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Abstract 


Linear  signal  transforms  (such  as  the  discrete  Fourier  transform)  are  frequently  used  in  digital  signal 
processing  and  related  fields.  Algorithms  for  computing  linear  transforms  are  well-understood  and 
typically  have  a  high  degree  of  regularity  and  parallelism,  making  them  well-suited  for  implemen¬ 
tation  as  sequential  datapaths  on  field-programmable  gate  arrays  or  application-specific  integrated 
circuits.  Nonetheless,  transforms  are  difficult  to  implement  due  to  the  large  number  of  algorithmic 
options  available  and  ways  that  algorithms  can  be  mapped  to  sequential  datapaths.  Further,  the 
best  choices  depend  heavily  on  the  resource  budget  and  performance  goals  of  the  target  application. 
Thus,  it  is  difficult  for  a  designer  to  determine  which  set  of  options  will  best  meet  a  given  set  of 
requirements. 

This  thesis  proposes  the  Spiral  hardware  generation  framework,  a  hardware  compilation  and  op¬ 
timization  tool  based  on  a  mathematical  formula  language.  A  formula  written  in  this  language  spec¬ 
ifies  a  particular  transform  algorithm  executed  using  a  particular  sequential  datapath.  This  language 
allows  high-level  representation  of  sequential  hardware  structures  that  reuse  datapath  elements  mul¬ 
tiple  times  in  the  computation  of  a  transform.  The  language  accomplishes  this  by  providing  a  formal 
connection  between  structure  in  the  algorithm  and  sequential  reuse  in  the  datapath.  This  proposed 
language  drives  a  hardware  compilation  system  that  takes  as  input  a  problem  specification  with  di¬ 
rectives  that  define  characteristics  of  the  desired  datapath;  the  system  then  automatically  generates 
an  algorithm,  maps  the  algorithm  to  a  datapath,  and  outputs  synthesizable  register  transfer  level 
Verilog. 

This  thesis  evaluates  the  generated  designs  when  synthesized  for  field-programmable  gate  array 
or  application-specific  integrated  circuit.  Its  evaluations  consider  designs  across  multiple  trans¬ 
forms,  datatypes,  and  design  goals,  and  its  results  show  that  Spiral  is  able  to  automatically  provide 
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a  wide  tradeoff  between  cost  (e.g.,  area,  power)  and  performance.  This  tradeoff  space  compares 
well  with  existing  benchmarks,  but  allows  the  designer  much  more  flexibility  to  find  the  design  best 
suited  to  his  or  her  needs. 
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Chapter  1 


Introduction 

1.1  Motivation 

Linear  signal  transforms  such  as  the  discrete  Fourier  transform  or  discrete  cosine  transform  are 
ubiquitous  in  digital  signal  processing  (DSP),  scientific  computing,  and  communication  applica¬ 
tions.  Algorithms  for  computing  linear  transforms  are  highly  structured  and  regular,  and  they  ex¬ 
hibit  large  amounts  of  parallelism.  For  these  reasons,  algorithms  in  this  domain  are  well-suited 
for  hardware  implementation  as  a  sequential  datapath  on  a  field-programmable  gate  array  (FPGA) 
or  application-specific  integrated  circuit  (ASIC).  This  inherent  algorithmic  structure  leads  to  a  large 
amount  of  freedom  in  the  way  a  given  algorithm  maps  to  a  datapath.  Coupled  with  the  fact  that  mul¬ 
tiple  algorithmic  options  are  typically  available  for  a  given  transform,  this  means  that  the  combined 
algorithm/datapath  space  is  far  too  large  for  a  designer  to  explore  easily. 

Further,  the  algorithmic  and  datapath  options  are  mutually  restricting;  the  best  choice  in  one 
domain  depends  on  which  choices  are  made  in  the  other.  However,  designers  typically  do  not  have 
the  tools  available  to  reason  about  both  sets  of  options  at  once.  Moreover,  the  best  choices  for 
both  algorithm  and  datapath  structure  are  highly  dependent  on  the  context:  an  application’s  specific 
performance  goals  and  cost  requirements. 

Typically,  a  designer  attempting  to  build  a  customized  hardware  implementation  of  a  linear  sig¬ 
nal  transform  will  alternate  between:  (a)  exploring  algorithms  to  be  executed  on  a  specific  type  of 
datapath,  and  (b)  exploring  different  datapaths  to  execute  a  specific  type  of  algorithm.  Either  way, 
the  designer  reasons  about  one  portion  of  the  problem  while  keeping  an  implicit  mapping  between 
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1.2.  Thesis  Overview 
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algorithm  and  datapath  in  his  or  her  mind.  This  human-driven  exploration  process  is  difficult  and 
slow,  and  few  designers  have  the  required  experience  with  both  algorithm  and  hardware  design  do¬ 
mains  needed  to  arrive  at  the  options  best  suited  for  their  application  requirements.  Often,  designers 
resort  to  using  solutions  from  IP  vendors  that  provide  common  designs  across  a  few  cost/perfor¬ 
mance  tradeoff  points. 

1.2  Thesis  Overview 

This  thesis  addresses  these  problems  by  introducing  a  high-level  mathematical  framework  for  auto¬ 
matically  generating  customized  hardware  implementations  of  linear  DSP  transforms.  This  system 
covers  a  wide  range  of  algorithmic  and  datapath  options  and  frees  the  designer  from  the  difficult 
process  of  manually  performing  algorithmic  and  datapath  exploration.  The  designs  produced  cover 
a  wide  cost/performance  tradeoff  space  and  are  competitive  with  good  hand-designed  implementa¬ 
tions.  The  system’s  automated  nature  allows  easy  exploration  of  a  wide  space  of  options  without 
sacrificing  the  quality  of  the  resulting  design. 

The  key  to  the  proposed  system  is  a  domain-specific  formula-based  language  for  specifying 
transform  algorithms  and  sequential  datapaths  on  which  to  execute  them.  This  language  extends  the 
language  in  [1]  to  include  datapath  concepts  such  as  parallelism  and  explicit  datapath  reuse,  which 
the  designer  specifies  at  a  high  level  of  abstraction.  A  single  formula  in  this  extended  language 
specifies  one  particular  algorithm  and  one  specific  sequential  datapath  on  which  to  execute  it. 

This  mathematical  language  drives  a  full  compilation  system  that  begins  with  a  problem  speci¬ 
fication  and  produces  a  synthesizable  register-transfer  level  Verilog  description.  This  system  begins 
by  taking  as  input  a  linear  signal  transform  of  a  given  size  as  well  as  high-level  hardware  direc¬ 
tives  that  describe  qualities  of  the  desired  datapath.  Then,  the  system  utilizes  a  base  of  algorithmic 
knowledge  to  construct  a  formula  that  specifies  a  transform  algorithm.  Next,  the  formula  is  rewritten 
(based  on  user-provided  directives)  to  produce  a  hardware  formula  that  explicitly  specifies  a  datap¬ 
ath  with  sequential  reuse  of  hardware  structures.  Lastly,  the  system  maps  the  hardware  formula  to  a 
corresponding  synthesizable  register-transfer  level  Verilog  description. 

An  important  challenge  in  mapping  from  formula  to  sequential  datapath  lies  in  the  hardware 
implementation  of  permutations  (fixed  reorderings  of  data,  commonly  occurring  in  transform  algo- 
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rithms).  Permutations  are  difficult  to  perform  on  streaming  data  because  the  system  must  buffer  and 
reorder  data  elements  distributed  over  a  number  of  cycles.  This  thesis  includes  a  flexible  param¬ 
eterized  architecture  for  performing  any  given  streaming  permutation  as  well  as  a  discussion  and 
evaluation  of  the  applicability  of  previous,  less  flexible  methods. 

Additionally,  this  thesis  includes  an  evaluation  of  designs  produced  using  the  proposed  gener¬ 
ation  framework  across  several  transforms  (such  as  the  discrete  Fourier  transform,  discrete  cosine 
transform,  and  others)  and  two  datatypes  (single  precision  floating  point  and  16  bit  fixed  point  are 
evaluated  here).  Results  are  obtained  by  synthesizing  for  FPGA  (Xilinx  Virtex-6)  and  ASIC  (tar¬ 
geting  a  commercial  65nm  standard  cell  library).  The  FPGA  results  compare  performance  with 
resources  required,  while  the  ASIC  results  compare  performance  with  power  and  area.  Results 
show  that  Spiral  is  able  to  generate  designs  across  a  range  of  cost/performance  tradeoffs. 

One  application  for  the  designs  produced  using  the  proposed  system  is  orthogonal  frequency- 
division  multiplexing  (OFDM)  for  optical  interconnects.  Optical  OFDM  has  recently  been  studied 
as  a  way  to  reduce  the  complexity  and  cost  of  optical  components  for  short  distance  links  (for  exam¬ 
ple,  in  data  centers)  [2]  or  to  improve  long-distance  transmission  performance  (telecommunications 
applications)  [3].  Optical  OFDM  depends  heavily  on  the  discrete  Fourier  transform  and  imposes 
a  high  throughput  requirement  on  it.  This  thesis  includes  a  study  of  real-time  implementations  of 
optical  OFDM,  using  hardware  cores  automatically  generated  by  the  proposed  framework. 

1.3  Contributions 

To  summarize,  the  contributions  of  this  thesis  include: 

•  a  formula-based  language  for  specifying  transform  algorithms  and  corresponding  sequential 
datapaths, 

•  an  automatic  compilation  tool  that  produces  synthesizable  register-transfer  level  Verilog, 

•  a  parameterized,  scalable  architecture  capable  of  performing  any  fixed  streaming  permutation, 

•  an  evaluation  of  the  designs  produced  using  this  system,  spanning  multiple  transforms,  data 
types,  and  platforms,  and 
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•  an  application  study  of  real-time  implementations  of  optical  OFDM  using  generated  designs. 


1.4  Organization 

This  thesis  is  organized  as  follows.  Chapter  2  describes  relevant  background  material  on  linear 
transforms,  their  algorithms,  and  algorithmic  specification  in  the  aforementioned  formula  language. 
Then,  Chapter  3  explains  the  proposed  mathematical  formula  language  for  describing  sequentially 
reused  datapaths.  Next,  Chapter  4  outlines  the  automatic  compilation  process  and  describes  each 
step  from  transform  to  formula  to  hardware  implementation.  Chapter  5  discusses  one  important 
compilation  problem:  implementing  permutations  on  streaming  data.  It  presents  a  proposed  flexible 
solution  for  any  fixed  streaming  permutation.  Chapter  6  examines  transform  algorithms  in  this 
context  and  illustrates  the  correspondence  between  algorithmic  and  datapath  structures.  Chapter  7 
evaluates  the  designs  produced  with  this  methodology  on  FPGA  and  ASIC.  Chapter  8  presents 
an  application  study  where  generated  cores  for  the  discrete  Fourier  transform  are  utilized  within 
real-time  systems  implementing  orthogonal  frequency-division  multiplexing  (OFDM)  for  optical 
networks.  Lastly,  Chapter  9  examines  related  work  and  Chapter  10  presents  concluding  remarks. 


Chapter  2 


Background:  Linear  Transforms  and 
Algorithms 


This  chapter  presents  background  material  on  linear  transforms  and  fast  algorithms  for  their  com¬ 
putation.  First,  Section  2.1  defines  linear  transforms  and  gives  relevant  examples.  Then,  Section  2.2 
shows  how  a  mathematical  formula  language  can  specify  transform  algorithms  and  how  formulas 
written  in  it  are  directly  translated  to  combinational  datapaths.  This  algorithm-specification  lan¬ 
guage  explicitly  captures  regularity  or  repetition  within  algorithms.  Later,  Chapter  3  will  show  how 
this  regularity  leads  to  hardware  structures  with  explicit  datapath  reuse.  Lastly,  Section  2.3  defines 
the  algorithms  considered  in  this  thesis. 

2.1  Linear  Transforms 

A  linear  transform  on  n  points  is  defined  by  a  dense  n  x  n  matrix.  Applying  the  transform  to 
an  n  point  input  vector  is  then  a  matrix-vector  multiplication.  For  example,  the  discrete  Fourier 
transform  on  n  points  is  defined  as  y  =  DFTn  x,  where  and  y  are  respectively  n  point  input  and 
output  vectors,  and 

DFTn  =  [w“]o<M<n,  Un  =  e"2™/n 

In  this  notation,  I) FT,,  is  an  n  x  n  matrix,  and  k  and  l  represent  the  row  and  column  of  a  given 
element  (respectively).  Thus,  computing  a  four-point  DFT  (with  input  vector  x  and  output  vector 
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y)  is  the  matrix- vector  product: 


I/O 

1111 

x0 

2/1 

1  i  —1—i 

X\ 

2/2 

1-1  1  -1 

X2 

2/3 

1  —  i  —  1  i 

X3 

Direct  computation  of  a  linear  transform  by  definition  requires  0(n2)  arithmetic  operations. 


2.1.1  Discrete  Fourier  Transform 

The  discrete  Fourier  transform  and  its  inverse  are  defined: 


Also  note  that 


DFT„  = 


t o, 


kH 


=  e~27Tj/n 


-  0 <k,£<n 


IDFTn 


(1/n) 


0<k,£<n 


DFTn  =  (DFTn)T. 


That  is,  the  DFT  matrix  is  symmetric. 


(2.1) 


(2.2) 


2.1.2  Two-Dimensional  Discrete  Fourier  Transform 

DFT  -2Dnxn  =  DFTn  0  DFTn,  (2.3) 


where  0  is  the  tensor  or  Kronecker  product ,  defined: 


B®  A  =  [bk/A\,  where  B  =  [bk,e\. 


2.1.3  Real  Discrete  Fourier  Transform 

When  the  DFT,  which  is  a  complex  transform,  is  performed  on  real  input  data  (length  n),  the  result 
is  an  n  point  complex  output  vector  with  conjugate-even  symmetry.  Thus,  half  of  its  output  data  is 
redundant.  The  real  discrete  Fourier  transform,  or  RDFT,  exploits  this  property  by  computing  only 
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the  non-redundant  outputs,  and  packing  them  into  a  real  n  point  output  vector.  Depending  on  the 
ordering  of  the  output  vector,  there  can  be  multiple  definitions  of  the  RDFT.  This  thesis  uses  the 
following  definition  for  RDFT: 


RDFTn 


1111...  1 
1-11-1  ...  -1 

cos^ 


sm 


2nk£ 


0<k<n/2 ,  0 <£<n 


(2.4) 


2.1.4  Discrete  Cosine  Transform  Type  2 


There  are  multiple  related  types  of  discrete  cosine  transforms  (DCTs)  and  discrete  sine  transforms 
(DSTs).  This  thesis  considers  the  most  common,  the  DCT  of  type  two,  defined: 


DCT-2 


cos 


k(2£+l)ir 
2  n 


0  <k,£<n 


The  DCT-2  is  easily  converted  to  other  related  transforms: 


(2.5) 


DCT-3,  =  (DCT-2,)"1  =  (DCT-2„)T 

1 

1 


DST-2,  = 


•  DCT-2,  • 


-1 


-1 


DST-3,  =  (DST-2,)"1  =  (DST-2,) 


The  algorithms  that  are  later  presented  for  DCT-2  can  easily  be  converted  to  DCT-3,  DST-2  and 
DST-3  in  this  way. 


2.2.  Formula  Representation  of  Transform  Algorithms 


8 


2.2  Formula  Representation  of  Transform  Algorithms 

Fast  transform  algorithms  enable  the  computation  of  an  n  point  transform  using  O  (n  log  n)  arith¬ 
metic  operations.  The  term  “fast  Fourier  transform”  refers  to  such  an  algorithm  for  computing  the 
discrete  Fourier  transform.  A  fast  transform  algorithm  can  be  expressed  as  a  decomposition  of  the 
n  x  n  transform  matrix  into  a  product  of  structured  sparse  matrices.  For  example, 


1 

0 

1 

1 

0 

1 

0 

0 

1 

0 

1 

1 

0 

1 

0 

1 

0 

0 

1 - 

0 

0 

1 

0 

1 

0 

1 

0 

0 

1 

-1 

0 

0 

0 

0 

1 

0 

1 

0 

-1 

0 

0 

0 

1 

0 

0 

0 

1 

1 

0 

1 

0 

0 

1 

0 

1 

0 

-1 

1 

0 

0 

0 

i 

1 

0 

0 

1 

-1 

1 

0 

0 

0 

1 

Thus,  computing  y  =  DFT4  x  is  equivalent  to  multiplying  the  data  vector  by  each  matrix  from  right 
to  left.  A  fast  algorithm  decomposes  annxn  transform  matrix  into  a  product  of  sparse  matrices 
(O  (log  n)  many),  each  which  requires  O  (n)  operations.  Thus,  multiplying  the  input  vector  by  each 
of  these  sparse  matrices  requires  O(nlogn)  operations. 

The  Kronecker  product  formalism  developed  in  [1]  uses  linear  algebra  concepts  to  mathemat¬ 
ically  represent  fast  transform  algorithms  by  capturing  the  structure  within  the  sparse  matrices  of 
the  decomposition.  This  formalism  can  serve  as  the  basis  for  a  language  that  represents  transform 
algorithms  as  formulas,  where  each  term  in  the  formula  has  a  corresponding  dataflow  interpreta¬ 
tion.  Thus,  a  formula  in  this  language  can  be  directly  translated  into  a  combinational  hardware 
implementation. 

In  Backus-Naur  form,  this  language  is  defined  as  follows: 

matrixn::=matrixn  •  •  •  matrixn 
fT  matrixn 

j  1^  <8>  matrixm  where  n  =  km 

|  Ik  matrixm  where  n  =  km 

|  basen 

basen  ::=Dn  =  diag  (d0, dn- 1)  \Pn\An 

This  language  is  a  subset  of  the  signal  processing  language  (SPL)  used  in  Spiral,  a  program 
generator  for  software  implementations  of  linear  transforms  [4,  5]. 
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B, 


Ar. 


y 


x0- 

Xi  ■ 

x2- 

X3. 


A: 


A: ; 


(a)  An  ■  Bn 


(b)  /2  <S)  A2 


xo  • - •  yo 

xi  yi 
•><*y2 

-•  y3 


X2 
X3  • 


Xo 
X 1 


yo 

Xo 

• — (ho)  • 

yo 

yi 

Xj 

• — (dj) — • 

yi 

y2 

X2 

•”■(^2) — • 

y2 

y3 

X3 

• — (dj) — • 

y3 

(c)  D4 

rO- 

-•yo 

-•yi 

(d)  Pi  =  Li 


(e)  A2  =  DFT2 


Xo  * 

Xi 

X2 

x3  » 


dft2 

dft2 

dft2 

dft2 

•  yo 

yi 


(f)  DFT4  =  L\  ■  (J2  ®  DFT2)  •  L\  ■  r24  •  (J2  ®  DFT2)  •  L\ 

Figure  2.1:  Examples  of  formula  elements  and  their  corresponding  combinational  datapaths. 

Next,  the  following  subsections  explain  each  portion  of  the  language,  and  illustrate  the  combi¬ 
national  hardware  interpretation  of  each  term  (in  Figure  2.1). 


2.2.1  Matrix  product 

A  matrix  formula  can  be  decomposed  into  a  product  (line  1)  or  iterative  product  (line  2)  of  matrix 
formulas.  Figure  2.1(a)  illustrates:  if  y  =  ( AnBn )  ■  x,  then  input  vector  x  first  is  transformed  by 
Bn,  then  by  An  to  produce  output  vector  y.  Note  that  the  matrices  are  applied  to  the  data  vector 
from  right  to  left. 

The  iterative  product  is  important  because  it  allows  the  explicit  specification  of  repeated 
stages  that  are  identical  or  related.  Chapter  3.2  discusses  how  this  repetition  is  exploited  to  represent 
explicit  datapath  reuse  in  the  proposed  hardware  compilation  and  optimization  framework. 
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2.2.2  Tensor  product 

Line  3  shows  that  a  matrix  formula  can  include  the  tensor  (or  Kronecker)  product  of  matrices.  The 
<g>  operator  is  the  tensor  product  operator,  defined: 


B®A  =  [ bk/A ],  where  B  =  [bkj\ . 


Ik  is  the  k  x  k  identity  matrix,  so 


Ik  0  A  rn  — 


Ay ] 


A, 


A 


m 


a  km  x  km  matrix  that  is  block-diagonal.  When  Ik  0  Am  is  interpreted  as  dataflow,  the  Arn  matrix 
is  applied  k  times  in  parallel  to  consecutive  regions  of  the  /{•///-length  input  vector.  Figure  2.1(b) 
illustrates  this  for  k  =  2,  m  =  2. 

Line  4  illustrates  an  indexed  version  of  the  tensor  product: 


A 


m 

0 


Ik  A™  = 


A? 


A 


m 

k- 1. 


where  the  index  variable  I  parameterizes  the  A  matrix. 

Although  this  language  only  supports  the  tensor  product  where  the  left  term  is  the  identity 
matrix,  simple  identities  rewrite  some  other  structures  into  this  form.  For  example: 


Am  0  Ik  =  Lk™(Ik  ®  Am)Lkm, 


(2.7) 
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where  L  is  the  stride  permutation  matrix  defined  in  (2.8),  below.  Further, 

Bfc  0  Am  =  (Bk  0  Im)(Ik  0  Am ) 

=  Lk^(Im®Bk)Lk^(Ik®Am). 


The  regularity  expressed  by  Ik  0  Am  is  critical  to  the  hardware  generation  and  optimization 
method  proposed  in  this  work.  Section  3.1  shows  how  the  regularity  represented  by  this  formula  is 
exploited  to  represent  explicit  datapath  reuse. 


2.2.3  Diagonal  matrices 

This  language  contains  three  classes  of  base  matrices.  First  are  diagonal  matrices,  written  Dn.  A 
diagonal  matrix  is  one  that  contains  non-zero  elements  only  along  its  main  diagonal: 


Dn  =  diag  (d0,  di, . . . ,  dn- 1)  = 


do 


di 


dn— 1 


where  the  dp  are  constants.  Multiplying  a  vector  by  Dn  results  in  scaling  each  element  by  one 
constant.  Figure  2.1(c)  illustrates  this  for  Dp .  This  thesis  uses  Dn  to  represent  any  generic  diagonal 
matrix  of  size  n.  A  diagonal  matrix  can  be  defined  with  a  parameter,  for  example  D™,  where  t  can 
be  the  index  variable  of  an  indexed  tensor  product  or  iterative  product. 


2.2.4  Permutations 

The  second  class  of  base  matrix  is  comprised  of  permutations,  or  fixed  reorderings  of  data  elements. 
Pn  denotes  a  permutation  on  n  points.  As  combinational  logic,  this  is  implemented  as  a  re-ordering 
of  data  by  shuffling  wires.  Figure  2.1(d)  illustrates  this  for  a  particular  permutation  on  four  points. 
This  thesis  uses  Pn  to  represent  an  arbitrary  n  point  permutation;  other  letters  will  be  used  to 
define  other  permutations  in  transform  algorithms.  A  permutation  can  be  viewed  as  a  matrix  or  as  a 
mapping  on  the  indices  of  data  elements.  For  example,  L”  represents  the  stride -by-m  permutation 
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on  n  points,  which  permutes  data  according  to 


L ^  :  i  •  ( n/m )  +  j  i— >  j  ■  m  +  i,  0  <  i  <  m,  0  <  j  <  n/m,  (2.8) 


where  in  +  j  represents  the  given  output  index,  and  jm  +  i  is  the  corresponding  input  index.  For 
example, 

1  0  0  0  0  0  0  0 

0  0  1  0  0  0  0  0 

0  0  0  0  1  0  0  0 

0  0  0  0  0  0  1  0 

0  1  0  0  0  0  0  0 

0  0  0  1  0  0  0  0 

0  0  0  0  0  1  0  0 

0  0  0  0  0  0  0  1 

Another  important  permutation  is  the  base-?’  digit  reversal  permutation  on  n  points:  /(" . 


L%  = 


t- 1 


r;  =  Yl(irt-i-i  ®  l: 


i+l. 


(2.9) 


1=0 


When  r  =  2,  this  permutation  is  called  the  bit  reversal  permutation. 


2.2.5  Computational  kernels 

The  hnal  class  of  base  matrices  is  comprised  of  computational  kernels  Matrix  An  denotes  a  generic 
n  x  n  computational  kernel  that  takes  in  n  data  elements  and  produces  n  output  elements.  Typi¬ 
cally,  such  a  kernel  is  only  used  when  n  is  small;  a  combinational  datapath  is  formed  by  directly 
implementing  a  matrix-vector  multiplication.  One  example  of  such  a  kernel  is 


DFT2 


1  1 
1  -1 


which  is  illustrated  in  Figure  2.1(e).  Computational  kernels  can  also  be  parameterized  by  index 
variables:  A1}. 
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2.2.6  Example 

A  linear  transform  algorithm  specified  in  this  language  can  be  directly  mapped  to  a  combinational 
datapath.  For  example,  the  Cooley-Tukey  FFT  (fast  Fourier  transform)  [6]  can  be  written  as 

DFTmn  =  LZn  •  (In  ®  DFTm)  •  L"m  •  T™n  •  (Im  ®  DFTn)  •  L™n,  (2.10) 

where  L"Jn  is  the  stride  permutation  (2.8),  T™'  is  a  diagonal  matrix  of  “twiddle  factors”  (as  speci¬ 
fied  in  [7]): 

771—1  /  n—  1  \ 

IT”=  ©  ffi“™  •  ».  =  e-2’i/n,  <2. 1 1) 

k= 0  \£=0  / 

and  0  is  the  direct  sum  operator.  So, 

T24  =  ©  {' 0  “4  )  =  ®  (w4'°  ©  w44)  =  w4  °  ©  w4  1  ©  ^4  °  ©  W4'1 

fc=0  \^=0  /  fc=0 

=  diag(l,  1, 1,  i). 

Figure  2.1(f)  shows  the  corresponding  combinational  datapath  for  this  algorithm  when  n  =  2  and 
m  =  2.  Again,  note  that  the  matrices  are  applied  from  left  to  right. 

2.3  Transform  Algorithms 

This  section  uses  the  mathematical  language  defined  in  Section  2.2  to  specify  the  transform  algo¬ 
rithms  considered  in  this  thesis.  Later,  Chapter  6  will  explain  how  these  algorithms  fit  within  the 
proposed  hardware/algorithm  specification  language,  and  Chapter  7  will  evaluate  implementations 
of  the  algorithms. 

2.3.1  Discrete  Fourier  Transform 

O(nlogn)  algorithms  for  the  DFT  are  often  called  fast  Fourier  transforms,  or  FFTs.  Note  that  the 
DFT  is  symmetric,  so  alternate  versions  of  these  algorithms  may  be  obtained  by  transposing  the 
final  formula,  using  property  (AB)T  =  BJ AJ .  Additionally,  these  algorithms  can  be  converted 
into  algorithms  for  the  IDFT  by  taking  the  complex  conjugate  and  scaling  as  defined  in  (2.2). 
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Cooley-Tukey  FFT.  The  Cooley-Tukey  FFT  (fast  Fourier  transform)  algorithm  [6]  recursively 
decomposes  a  DFT  of  size  mn  into  DFTs  of  size  m  and  n.  It  is  given  by 

DFTmn  =  LZn  •  (In  ®  DFTm)  •  L"m  •  T?n  ■  (Im  ®  DFTn)  •  L™,  (2-12) 

where  L  and  T  are  as  defined  previously. 

Pease  FFT.  The  Pease  FFT  [8]  is  an  iterative  algorithm  for  the  DFT  that  can  be  derived  from 
the  Cooley-Tukey  FFT  (as  shown  in  [9]).  It  is  given  by 

DFTrt  =  (  IJ  Lr  ‘  (/®-1  0  DFTV)  '  °e  )  '  Rr\  (2-13) 

Wo  / 

where  C  is  a  diagonal  matrix  of  twiddle  factors: 


ci  =  ■  (/r 


rpr*-1 


L 


where  L,  R  and  T  are  as  defined  previously. 

Here,  the  DFT  of  size  r*  is  computed  by  first  reordering  data,  then  passing  through  t  stages, 
each  which  scales  data,  performs  parallel  DFTr  computations,  and  then  reorders  the  data  with  L'l . 

Parameter  r  is  called  the  radix.  Higher  radices  give  fewer  stages,  but  each  stage  is  more  com¬ 
plicated.  As  the  radix  increases,  the  arithmetic  cost  of  the  algorithm  decreases,  because  fewer 
non-trivial  multiplications  are  performed  (although  the  improvement  diminishes  as  r  increases). 

A  key  characteristic  of  the  Pease  FFT  is  that  the  data  vector  is  accessed  in  the  same  order  in 
each  of  the  t  iterations.  In  the  formula,  this  can  be  seen  in  the  fact  that  only  the  diagonal  matrix  C 
depends  on  iteration  index  I.  Thus,  the  only  change  from  stage  to  stage  is  in  the  constants  the  data 
vector  is  scaled  by. 

Figure  2.2  illustrates  the  dataflow  for  the  Pease  algorithm  for  DFT23: 

(n^2-(/4®DFT2)-C|J  Rl 

\r=o  / 

The  formula  is  read  from  right  to  left,  but  the  dataflow  is  drawn  from  left  to  right.  So,  Figure  2.2 
takes  in  input  vector  x  on  the  left,  passes  it  through  permutation  fi2,  then  through  three  stages 
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►  y 


Rl  Cl  h  ®  dft2  l\  cf  h  ®  DFTa  L\  Cl  h  ®  dft2  l\ 
Figure  2.2:  Pease  FFT  dataflow  for  DFT23. 

(indicated  with  shaded  background).  Each  stages  performs  a  scaling  by  a  diagonal  matrix  C,  four 
parallel  DFT2  computations,  and  permutation  L2.  Note  that  multiplications  by  1  are  omitted  from 
the  diagram,  and  that  Cq  =  1$. 

Iterative  Cooley-Tukey  FFT.  Similar  to  the  Pease  FFT,  the  Iterative  Cooley-Tukey  algorithm 
can  also  be  derived  from  (2.12),  but  it  results  in  a  different  structure. 


a- 1 


DFTrt  =  I <g>  DFTr  •  (. Iri  (2)  T; 


pr1 

Xtr  , 


(2.14) 


e=o 


where  all  matrices  are  as  previously  dehned.  By  utilizing  the  property  given  in  (2.7)  and  restructur¬ 
ing  terms,  this  algorithm  can  be  expressed 


a- 1 


DFTrt  =  L;'  (  Yl  (4*-i  2  DFTr)  (lrt  2  (e\ 


\i= 0 


y,t—£  -nt-i 


pr* 

lbr  5 


(2.15) 


where  E  is  a  diagonal  matrix  of  twiddle  factors: 
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and  Q  is  a  product  of  two  stride  permutations: 


This  thesis  will  utilize  the  restructured  algorithm  (2.15),  and  refer  to  it  as  the  Iterative  FFT.  This 
algorithm  is  similar  to  the  Pease  FFT:  both  compute  DFTrt  using  t  stages,  each  stage  consisting 
of  a  permutation,  a  scaling  by  constants,  and  parallel  DFTr  computations.  However,  there  is  one 
important  difference:  in  each  iteration,  the  Pease  algorithm  performs  a  permutation  on  r*  points 
and  multiplies  the  vector  by  an  rf  point  diagonal  matrix.  In  contrast,  the  permutation  and  diagonal 
matrix  in  this  algorithm  grow  smaller  as  the  stages  progress.  This  means  in  stage  l,  the  algorithm 
performs  rl  parallel  permutations  on  r1'^1  points.  Chapter  6  will  show  how  these  differences  lead  to 
a  different  type  of  hardware  implementations  than  are  obtained  from  the  Pease  algorithm. 

Figure  2.3  illustrates  the  dataflow  for  the  Iterative  FFT  algorithm  for  DFT23: 

DFT23  =  L\  ( I]  (h  ®  DFT2)  (/*  ®  ( Ef  ~l  ■  Qf "')) )  Rl 
Vr=o  / 

Again,  the  formula  is  applied  from  right  to  left,  while  the  diagram  takes  its  input  x  on  the  left  side 
and  produces  its  output  y  on  the  right.  First,  the  data  vector  is  permuted  by  fZ2.  Then,  it  goes 
through  three  stages,  separated  with  shaded  boxes.  Note  that  the  first  stage  consists  only  of  one 
column  of  1 4  (g>  DFT2,  because  Ef  and  Q\  are  both  equal  to  /2  and  thus  do  not  require  hardware. 

Mixed-radix  FFT.  The  mixed-radix  algorithm  (based  on  (2.12))  breaks  down  a  DFT  of  size 
rks£  into  multiple  DFTs  of  sizes  rk  and  s 


DFTrrs£ 


=  L 


r^S<- 

k 


(Is 


>  DFTrfe 


k  f  k  f 

.  Lr.s  ■  T»s 


(Ir 


DFT. 


\  rr^sc 
t )  *  Lrk 


(2.16) 


Then,  the  DFTri;  and  DFTS<  matrices  are  decomposed  using  radix  r  and  s  algorithms  such  as  (2. 13) 
or  (2.15). 

Bluestein.  The  Bluestein  FFT  [10]  is  a  convolution-based  algorithm  for  any  problem  size  n.  It 
is  typically  used  when  the  problem  size  makes  it  difficult  or  impossible  to  perform  with  the  other 
algorithms.  The  Bluestein  algorithm  performs  the  DFT  by  scaling  the  input  vector  and  convolving 
it  with  pre-computed  coefficients.  The  convolution  of  two  n  length  signals  can  be  performed  as 
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x - ►  y 


R-2  I4,  0  DFT2  I2  0  Q  j  /2  0  E{  I4.  0  DFT2  Qq  Eq  I4  0  DFT2  l\ 


Figure  2.3:  Iterative  FFT  dataflow  for  DFT23. 

point- wise  multiplication  of  length  m  >  2n  —  1  in  the  frequency  domain.  This  allows  a  DFT  of  any 
given  size  to  be  computed  using  DFTs  of  two-power  size,  at  the  expense  of  additional  operations. 
This  algorithm  is  given  by: 


_  J-)  (2) 

lyr  j-n  —  ^ nxm 


IDFTm  -DX  .  DFT™  -D("i 


(2.17) 


where  rn  =  2  (r/)l  + 1  js  the  smallest  power  of  two  greater  than  2n  —  1,  and  the  D  matrices 

are  diagonal  matrices  that  scale  the  vector  by  constant  values.  D^xn  and  D^xm  are  rectangular 

matrices,  so  in  addition  to  scaling  the  data,  D^xn  extends  the  input  data  vector  from  n  points  to  m 

(2) 

points  (by  zero  padding),  and  Dnxm  shortens  the  output  data  vector  from  m  points  to  n  points  (by 
discarding  unneeded  data). 
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2.3.2  Two-Dimensional  Discrete  Fourier  Transform 

Using  the  definition  of  the  two-dimensional  DFT  and  a  property  of  the  tensor  product,  the  row- 
column  algorithm  can  be  derived: 


DFT-2Dnxn  =  (4  <g>  DFTn)  •  L*  ■  (4  ®  DFTn)  •  l£ 


l 


(2.18) 


Then,  one-dimensional  DFT  algorithms  such  as  (2.13),  (2.15),  or  (2.16)  can  be  used  to  compute  the 
DFTn. 


Other  algorithms  for  computing  multi-dimensional  DFT  not  considered  in  this  thesis  include 
the  vector-radix  algorithm  [11]  and  the  Dimensionless  FFT  algorithm  [12]. 

2.3.3  Real  DFT 

Complex  half-size  Real  DFT.  An  RDFT  of  size  n  can  be  computed  using  a  complex  DFT  of  size 
n/2  and  a  post-processing  step,  as  shown  in  [13]: 


RDFTn  =  ((K'2n  ®  I2)  •  (/n/4  %  A^l))  •  (( K t?)'1  0  J2))  •  DFTn/2 


(2.19) 
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where 


A4(0)  = 


110  0 

1-10  0 

0  0  10 

0  0  0  1 


A,(£  /  0)  =  diag(l,  1, 1,  -1)  •  (F2  0  4) 


c  =  cos(27t  •  £/n)/ 2,  s  =  sin(2-7r  •  £/n)/ 2, 

1  1 


1/2 

1 

0 

1 

0 

1/2 

•  r4 . 

L  2 

0 

1 

0 

1 

c  —  s 

0 

1 

0 

-1 

s  c 

-1 

0 

1 

0 

F2  = 


1  -1 


K'^m  =  (4+1  ©  4-1  ©  4+i  ©  4_i . . .  )4T, 


km 


Jn  is  the  n  x  n  identity  matrix  with  columns  reversed,  and  DFTn/2  converts  ann/2xn/2  complex 
matrix  into  an  x  n  real  matrix  by  replacing  each  complex  element  a  +  bi  with 

a  —b 
b  a 

This  algorithm  works  by  packing  the  n  real  data  as  n/2  complex  data,  performing  a  complex 
DFTn/2  using  the  DFT  algorithms  presented  above  in  Chapter  2.3.1.  Then  it  permutes  data  by 
(. K~l  0  4),  performs  one  column  of  computational  blocks  (A),  and  finally  permutes  the  data  again 
to  output  data. 

One  way  to  reduce  the  cost  of  this  implementation  is  to  merge  the  rightmost  permutation  stage 
(A'-1  0  4)  with  the  output  permutation  on  the  DFT  algorithm.  For  example,  the  iterative  Cooley- 
Tukey  algorithm  (2.15)  contains  a  stride  permutation  L  on  its  output  side,  which  can  be  merged 
directly.  Or,  the  Pease  algorithm  (2.13)  can  be  transposed,  moving  the  R  permutation  to  the  output 
side  (noting  that  R™  =  (A”)T). 


2.3.  Transform  Algorithms 


20 


Although  this  algorithm  has  higher  arithmetic  cost  over  the  native  RDFT  algorithms  presented 
next,  it  is  useful  because  all  optimizations  for  the  DFT  can  be  immediately  applied. 

Native  RDFT  Algorithm:  Cooley-Tukey-type.  Based  on  the  framework  in  [13],  three  algo¬ 
rithms  for  the  RDFT  can  be  defined,  similar  in  structure  to  the  Cooley-Tukey,  Pease,  and  Iterative 
Cooley-Tukey  FFTs  (as  shown  in  (2.12),  (2.13),  and  (2.15),  respectively).  All  three  algorithms  use 
a  helper  transform  rDFT(it),  where 

RDFTn  =  (F'2  ©  In- 2)  •  rDFTn(0). 

First,  a  recursive  algorithm  is  defined  that  will  be  used  as  the  basis  for  two  iterative  algorithms. 

rDFT 2m(rk(u,j))  |  (rDFT2fc(tt)  <g>  Im)  (2.20) 


(k- 1 

© 

3= 0 


rk(u,j) 


r^2km 

Kf2km 
m 


j/2k,  u  =  0, 

*  (u  +  Li/2_|)A,  w  /  0  and  j  even, 
(1  —  u  +  [j /2\)/k,  rt/Oandjodd 

((4  ©  Jk  ©  4  ©  Jk  ■  ■  ■  )iim)  ®  h, 

(( 4  ©  Zk  ©  4  ®Zk...  )L\ <g>  4, 


Zk  =  (4  ©  Jk- 1)) 


(A\B)i 


A ,  i  0 

B ,  i  =  0 
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where  J  is  I  with  the  column  order  reversed.  The  base  cases  are  given  by: 
rDFT4(0)  =  (F2  <g>  I2) , 


1 

1 

rDFT^it  /  0)  = 

1 

•  rDFT4(0)  ■ 

1 

1 

c  —  s 

-1 

s  c 

c  =  cos(7 ru),  s  =  sin(7ru). 


Native  RDFT  Algorithm:  Constant  Geometry.  By  left-expanding  and  applying  identities  to 
algorithm  (2.20),  the  following  constant  geometry  RDFT  algorithm  can  be  derived.  Let  the  problem 
size  be  2 km,  where  m  =  kp,  p  >  1.  Let  q  =  log/.,  m.  The  algorithm’s  radix  is  2k. 


rDFT2fcm(0)  =  V}: 


2km 


^rDFT2fc(Sfci9(0,0,/fcim(i,£)))L^m  L 


-  2  km 
km  1 


(2.21) 


where 


fk,m(iJ)  = 

mod  m/kl , 

Sk,q(d,U,f)  =  < 

u, 

d  =  q 

rk(Sk,q{d  +  L  U,  /),  [ (/ 

mod  kd+1)/kd J), 

d^q 

v,2km  = 

r  k,m  \ 

/logfe  m—  1 

n  (kr/r'M 
(  ^=0 

>-1 

)  J  ®/2 

The  algorithm  begins  with  a  stride  permutation,  followed  by  log/.,  m  +  1  stages,  each  with  a  permu¬ 
tation  and  a  radix  2k  computational  basic  block.  This  block  (rDFT2fc  (...))  is  computed  using  the 
base  case  above  when  k  =  2,  and  using  (2.20)  otherwise.  The  structure  of  this  algorithm  is  called 
constant  geometry  because  the  permutation  L2^1  is  the  same  in  each  stage. 

Native  RDFT  Algorithm:  Iterative.  Also  derived  from  (2.20)  is  the  following  iterative  algo¬ 
rithm.  Let  the  problem  size  be  2km,  where  m  =  kp,p  >  1.  Let  q  =  log/,  m.  Again,  the  algorithm’s 
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radix  is  2k. 


(2.22) 


•  (lm  rDFT2fc(sfc)g_j(0, 0,  [£/&})))  ■  L%m 


where  IT  is  a  product  of  stride  permutations: 


and  V  and  .s()  are  as  defined  above. 

This  algorithm  is  similar  to  the  constant  geometry  RDFT  algorithm  (2.21).  However,  un¬ 
like  (2.21),  this  algorithm’s  permutation  W  grows  more  local  as  the  algorithm  is  performed  (that 
is,  as  i  decreases).  So,  its  different  stages  have  different  geometry,  but  the  interconnections  do  not 
need  to  span  the  entire  data  vector  in  all  stages.  The  relationship  between  this  algorithm  and  the 
constant  geometry  RDFT  algorithm  is  similar  to  the  relationship  between  the  Iterative  and  Pease 
FFT  algorithms  (2.15)  and  (2.13). 

2.3.4  Discrete  Cosine  Transform 

An  algorithm  for  the  DCT  of  type  2  that  is  well-suited  for  hardware  implementation  is  given  in  [14]. 


(2.23) 


Matrices  U  and  PH  represent  permutations,  and  S  represents  one  stage  of  computation,  consisting 
of  a  4  point  basic  block  and  a  diagonal  matrix.  The  full  specification  of  this  algorithm  is  given 
below. 

The  structure  of  this  algorithm  is  similar  to  the  FFT  algorithm  given  in  (2.15).  It  computes  a 


transform  of  size  n  in  log2(n)  stages,  each  of  which  reorders  data  and  performs  a  small  computa¬ 
tional  kernel.  The  added  complexity  comes  from  the  M  and  H  matrices,  which  perform  an  extra 
subtraction  and  data  reordering  in  some  basic  blocks. 


23 


Chapter  2.  Background:  Linear  Transforms  and  Algorithms 


Using  the  properties  of  the  DCT  and  DST  given  in  Chapter  2.1.4,  this  algorithm  can  easily  be 
used  to  produce  implementations  of  DCT-3,  DST-2,  and  DST-3.  Other  similar  algorithms  can  be 
found  in  [15]  and  [16]. 

Unlike  the  other  algorithms  examined  in  this  thesis,  this  DCT  algorithm  is  defined  only  for  a 
fixed  radix  size  (two). 

Full  specification  of  algorithm  (2.23).  Matrix  S  represents  one  stage  of  computation,  parame¬ 
terized  by  stage  s: 

=  M-2  ’  ))  ■  diag(pfc(£,  s))  ■  (l2k-2  H \  •  (I2 *-1  <8>  F2), 


where 


1  0 

-Hs{£)  1 


0,  £  mod  2s  =  0, 

1,  t  mod  2s  ^  0 


gk(£,s)  =  (2^l2\)d{2k-s~1  +  [£/2s+1\))fk^ 


d(£)  =  cos {(hK(£  -  K)  +  l/2)n/2K),  K  =  2^°^ 
hi(0)  =  0,  /i2Af(2f)  =  h]\r(£),  /i2Ar(2f  +  1)  =  2N  —  1  —  /ijv(£) 

fk{£,  s)  =  {£  mod  2)  +  (1  -  r0(f))(l  -  Tk-i(s)) 


Ti(s)  =  < 


0,  S  = 

1,  s  / 

10  0  0 
0  0  0  1 
0  0  10 
0  10  0 


-1  /is_ l(f) 


F2 


1  1 
1  -1 
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Permutations  U  and  PH  are: 


k- 2 

=  ^  2— 2fc— ^  ^  (-^2  ®  ^2)))  '  (-^2^  ®  £— 1), 

£=0 

P^l  =  [n  =  /l2fe(m)]m,n=0,l,...,2fc-l> 


where  h2k  (m)  is  as  defined  above,  and  J  is  the  identity  matrix  with  columns  reversed. 


Chapter  3 


Formula-Based  Datapath 
Representation 


The  language  described  in  the  previous  section  can  represent  a  wide  range  of  algorithms,  but  it 
does  not  represent  sequential  reuse  of  datapath  components,  where  one  computational  block  is  used 
multiple  times  while  computing  a  single  problem.  Sequential  reuse  is  necessary  since  the  combi¬ 
national  datapaths  realized  in  Chapter  2  are  often  too  large  for  practical  implementation  for  all  but 
the  smallest  transform  sizes.  This  chapter  describes  extensions  to  this  formula  language  to  repre¬ 
sent  two  types  of  sequential  reuse  that  are  relevant  for  hardware  designs.  The  result  is  a  hardware 
language  that  allows  explicit  datapath  description  at  the  formula  level.  Later,  this  thesis  shows  that 
this  extended  language  drives  the  proposed  compilation  system,  and  that  it  allows  description  and 
generation  of  a  wide  tradeoff  space. 

The  work  described  in  this  section  was  presented  in  part  in  [17]. 


3.1  Streaming  Reuse 

Streaming  reuse  restructures  a  datapath  with  parallel  computation  blocks  into  a  smaller  datapath 
where  data  stream  through  the  system  over  multiple  cycles. 

As  shown  in  Section  2.2,  the  tensor  product  Im  0  An  results  in  rn  data-parallel  instantiations 
of  the  block  An  (Figure  3.1(a)).  However,  other  structures  can  also  perform  the  same  computa¬ 
tion.  For  example,  the  tensor  product  can  be  interpreted  as  reuse  in  time  (rather  than  parallelism  in 
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size 
(in  x  n  ) 
vector 


Hf 

"{[■ 

l-{| 


A„ 


A„ 


A„ 


(a)  No  streaming  reuse:  Im  ®  An . 


one  streamed  vector, 
size  (m  x  «) 


per  i 


m  cycles 

(b)  Full  streaming  reuse:  Im  <g>sr  A, 


w  words 
per  cycle 


{I 


A  n 


a 

o 
o 

?r 

S- 


mn/w  cycles 

(c)  Partial  streaming  reuse:  Imn/W  ®>sr  (lw/n  ®  An). 


Figure  3.1 :  Examples  of  streaming  reuse. 


space).  Then,  one  can  build  a  single  instance  of  block  An  and  reuse  it  over  m  consecutive  cycles 
(Figure  3.1(b)).  Rather  than  all  mn  input  points  entering  the  system  concurrently,  they  now  stream 
in  and  out  at  a  rate  of  n  words  per  cycle.  This  is  defined  as  streaming  reuse  and  represented  as 
Im  ®sr  An.  The  streaming  width  indicates  the  number  of  inputs  (or  outputs)  that  enter  (or  exit)  a 
section  of  datapath  during  each  cycle,  ffere,  the  streaming  width  is  n. 

The  two  interpretations  of  ®  can  be  nested  in  order  to  build  a  partially  parallel  datapath  that  is 
reused  over  multiple  cycles  (Figure  3.1(c)).  In  general,  Im®An  can  be  written  as  Imn/w (Iw/n ® 
An),  which  results  in  a  datapath  with  a  streaming  width  of  w,  consisting  of  w/n  parallel  instances 
of  An,  reused  over  mn/w  cycles  (w  is  a  multiple  of  n\w<  mn).  Increasing  the  streaming  width 
increases  the  datapath’s  cost  and  throughput  proportionally. 


3.1.1  Streaming  reuse  of  other  formula  constructs 

Indexed  tensor  product.  Streaming  reuse  can  also  be  applied  to  the  indexed  tensor  product 


Im.  Ag  , 
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where  A  is  an  n  x  n  matrix  parameterized  by  t.  When  this  construct  is  streamed  with  width  w  =  n, 
one  computational  block  is  built  that  is  capable  of  performing  all  A™,  0  <  l  <  m.  In  the  worst  case, 
this  can  lead  to  a  roughly  m  x  overhead  in  the  hardware  cost,  if  the  m  instances  cannot  share  logic 
between  them.  However,  often  this  hardware  block  can  be  simplified.  For  example,  the  DCT-2 
algorithm  (2.23)  presented  in  Chapter  2.3.4  contains  the  term 

(/2*-i  ®<^), 


where 


1  0 

-MsW  1 


Ps{t) 


0,  i  mod  2s  =  0, 
1,  l  mod  2s  /  0 


When  streaming  reuse  is  applied  to  this  formula,  only  two  possible  instances  of  M  need  to  be 
considered:  one  that  is  equivalent  to  I2,  and  one  that  performs  one  subtraction.  In  hardware,  this  can 
be  realized  as  one  subtractor  and  one  multiplexer,  with  a  comparator  controlling  the  multiplexing 
logic. 

Diagonal  matrices.  Diagonal  matrices  scale  each  data  element  by  a  constant.  An  n  point 
diagonal  can  easily  be  streamed  with  width  w  by  building  w  multipliers,  each  holding  its  own 
lookup  table  of  n/w  constants. 

The  simplest  way  to  write  this  as  a  streaming  reuse  formula  is  to  reformulate  the  expression  into 
one  that  uses  /0£.  So,  a  diagonal  Dn  with  streaming  width  w  is 


StreamDiag(-Dn,  w)  =  (ln/w  ((Iw  (Dn  o  f(k)))  o  g(£)))  ,  (3.1) 


f{k)  =  ln/w  ®  (k)w, 

g{()  —  Lnjw  o  ((£)n/w  0  iw)  1 


where 
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and  i,  and  o  are  indexing  functions  as  defined  in  [18]: 

•  I71  *  In?  1  1  *  ^? 

(. j)n  ■  II  Ini  *  ^  j, 

go  f  :  Im  ->  Ijv;  i  g(f(i)), 

In  =  {0, . . .  ,n  -  1}. 

First,  the  diagonal  matrix  is  split  into  w  partial  lookup  tables.  Each  holds  the  values  that  will 
be  multiplied  by  inputs  from  one  of  the  w  input  ports.  The  rightmost  term  in  the  formula  serves 
to  separate  the  n  point  diagonal  into  n/w  segments  of  size  w  (each  corresponding  to  one  cycle 
of  streaming  reuse).  The  values  of  these  segments  can  then  be  computed  and  the  results  stored  in 
lookup  tables. 

This  representation  also  allows  easy  simplification  in  the  case  of  certain  repeated  patterns  within 
the  diagonal  matrix.  For  example,  every  other  element  of  twiddle  diagonal  T%  is  equal  to  1.  So, 
when  this  diagonal  is  streamed  (with  2  power  streaming  width),  half  of  the  corresponding  multipliers 
will  always  multiply  by  1.  Using  this  representation  makes  this  situation  (and  similar  ones)  easy  to 
recognize  within  the  compiler,  allowing  automatic  removal  of  the  unneeded  multiplier. 

Permutation  matrices.  Implementing  streaming  reuse  on  permutation  structures  is  a  difficult 
problem,  but  one  that  is  crucially  important.  A  permutation  streaming  with  width  w  must  take  in 
an  n  point  input  vector  at  a  rate  of  w  words  per  cycle.  It  must  buffer  the  data  stream,  perform  a 
reordering  over  the  entire  n  data  points,  and  then  stream  the  data  vector  out.  Chapter  5  presents  two 
methods  for  constructing  this  type  of  system.  This  thesis  uses  a  streaming  permutation  wrapper 

StreamPerm(Pn,  w) 


as  a  way  of  representing  this  structure. 
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- v - '  1  block,  reused 

m  blocks  m  times 


(a)  No  iterative  reuse:  An. 


(b)  Full  iterative  reuse:  An. 


d  blocks,  reused 
m/d  times 

(c)  Partial  iterative  reuse:  Yim/d  did  An). 

Figure  3.2:  Examples  of  iterative  reuse. 

3.2  Iterative  Reuse 

The  product  of  m  identical  blocks  An  can  be  written  as  Wm  An .  A  straightforward  interpretation  of 
this  is  a  series  of  m  blocks  cascaded  (Figure  3.2(a)). 

The  same  computation  can  be  performed  by  reusing  the  An  block  m  times  (Figure  3.2(b)).  Now, 
the  datapath  must  have  a  feedback  mechanism  to  allow  the  data  to  cycle  through  the  proper  number 
of  times.  This  is  called  iterative  reuse  and  is  represented  by  adding  the  letters  “ir”  to  the  product 
term:  An.  By  nesting  both  kinds  of  product  terms,  the  formula  specifies  a  number  of  cascaded 

blocks  to  be  reused  a  number  of  times  (Figure  3.2(c)).  In  general,  \\m  An  can  be  restructured  into 
ir;rl/diUd  An),  resulting  in  d  cascaded  instances  of  An,  iterated  over  m/d  times  (where  m/d  is  an 
integer).  The  term  depth  is  used  to  indicate  the  number  of  stages  built  (here,  d). 

When  an  iterative  reuse  datapath  is  built,  it  is  important  that  the  reused  portion  of  the  datapath 
buffer  the  entire  vector,  so  the  “head"  of  the  data  vector  does  not  feed  back  too  soon  and  collide  with 
its  own  “tail.”  This  is  equivalent  to  requiring  that  the  latency  (in  cycles)  be  at  least  1  / (its  throughput 
in  transforms  per  cycle) .  If  the  datapath  does  not  naturally  have  this  property,  it  is  necessary  to  add 
buffers  to  increase  its  latency. 
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3.2.1  Iterative  Reuse  of  Other  Formula  Constructs 


Dependence  on  iteration  variable.  Iterative  reuse  can  also  be  applied  to  the  iterative  product 


771—1 


n-4?. 


where  the  nxn  matrix  A  is  parameterized  by  iteration  variable  l.  If  this  formula  is  iteratively  reused 

(£) 

with  depth  d  =  1,  then  one  computational  structure  capable  of  performing  all  of  the  variants  of  An 
(where  0  <  t  <  m)  is  constructed.  In  the  worst  case,  m  different  independent  blocks  must  be 
built,  but  often,  the  structure  of  the  algorithm  allows  these  blocks  to  be  simplified  by  sharing  logic 
or  arithmetic  units  (for  example,  the  basic  block  in  algorithm  (2.21)).  This  situation  is  analogous  to 
streaming  reuse  of  the  indexed  tensor  product  in  Section  3.1.1. 

Diagonal  matrices.  Diagonal  matrices  are  often  used  in  product  terms  with  a  dependence  on 
the  iteration  variable: 

m— 1 

n  d" 

£=0 

This  formula  can  be  iteratively  reused  by  storing  all  mn  constants,  and  adding  simple  logic  to 
choose  from  the  correct  n  depending  on  the  value  of  l.  This  also  works  in  the  streaming  case,  where 
it  can  be  written  as  StreamDiag(D",  w). 

An  example  of  this  type  of  formula  construct  can  be  found  in  the  Pease  FFT  (2.13). 
Permutation  matrices.  Similar  to  diagonal  matrices,  iterative  reuse  can  also  be  applied  to  per¬ 
mutations.  If  streaming  reuse  is  not  used,  then  this  simply  is  an  instance  of  the  generic  dependence 
on  iteration  variable  discussed  above.  When  both  streaming  and  iterative  reuse  are  used,  the  formula 
becomes 

771—1 

J  StreamDiag(P”,  w). 

1=0 

Chapter  5  describes  how  this  permutation  is  implemented. 


3.3  Formula-Based  Hardware  Model 

Table  3.1  describes  rules  for  deriving  the  latency,  throughput,  and  an  approximate  area  cost  of  four 
basic  formula  constructs.  Given  a  matrix  formula  F,  the  table  provides  formulas  for  latency  L(F) 
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Formula  F 

Latency  L(F) 

Throughput  T(F) 

Area  cost  C(F) 

F  —  ,4®  •  A ^ 

J-  n  —  -tin  Jin 

e  m^)) 

minfTfA^)) 

E  ,(G(A«)) 

Fn.  =  III  An 

max(T(^);fc  •  L(An)) 

mm(  HT1’ k-LtAn)) 

C(An)  +  G(mux) 

Fmn  =  Im  &  -A-n 

L(An) 

T(An) 

m  ■  C(An) 

Fmn  =  Im  An 

L(An) 

T(An)/m 

C(An) 

Table  3.1:  Given  a  matrix  formula  F,  formulas  for  latency  L(F)  (cycles),  throughput  T(F)  (trans¬ 
forms  per  cycle)  and  approximate  area  cost  C(F)  (relative  to  the  area  cost  of  sub-modules). 

(cycles),  throughput  T(F)  (transforms  per  cycle)  and  approximate  area  cost  C(F)  relative  to  the 
latency,  throughput,  and  area  of  F’ s  submodules.  The  entries  of  this  table  can  be  recursively  used  to 
reason  about  complex  formulas.  Section  3.4  will  give  an  example  for  a  common  formula  type  that 
can  combine  streaming  reuse  and  iterative  reuse. 

3.4  Combining  Streaming  and  Iterative  Reuse 

Often,  transform  algorithms  contain  the  form  Wk{Im  <8>  An).  This  structure  can  utilize  both  iterative 
reuse  (due  to  the  H)  and  streaming  reuse  (due  to  Im  Q  An),  allowing  a  wide  range  of  hybrid 
implementations  that  exhibit  flexibility  across  two  dimensions.  One  can  restructure  this  formula  to 
have  streaming  and  iterative  reuse  of  parameterized  amounts: 

k/d—l  /  d- 1  \ 

II  II  ^nm/w  ®Sr  Uw/n  ®  An))  , 

e0=o  \l=o  / 

where  d  is  the  depth  of  the  cascaded  stages  (ranging  from  1  to  k;  k/d  must  be  an  integer).  Parameter 
w  is  the  streaming  width,  a  multiple  of  n. 

This  parameterized  datapath  is  illustrated  in  Figure  3.3.  Each  stage  consists  of  w/n  parallel 
instances  of  Ari ;  d  stages  are  built  in  series.  Let  Bmn  represent  this  array  of  dw/n  many  An  blocks. 
Data  are  loaded  into  the  cascaded  stages  at  a  rate  of  w  per  cycle  over  mn/w  cycles.  The  vector 
feeds  back  and  passes  through  the  series  of  stages  a  total  of  k/d  times. 

Latency  and  throughput.  Given  this  combined  reuse  example,  one  can  use  the  structure  of  the 
formula  to  analyze  the  effect  of  parameters  d  and  w  on  the  datapath.  The  following  are  calculations 
that  correspond  to  evaluating  the  general  rules  from  Table  3.1  for  the  specific  parameters  of  this 
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w  words 
per  cycle 


n 


n 


{! 

{| 


mn/w  cycles 


d  stages, 
reused  k/d  times 


Figure  3.3:  Combining  iterative  and  streaming  reuse:  II k/d  {T\d^nm/w  (Au/n  ®  ^n)))  • 


combined  reuse  example  (Figure  3.3).  These  calculations  assume  that  Bmn  (the  collective  block  of 
An  blocks)  is  fully  pipelined,  i.e.,  its  throughput  is  dictated  by  the  problem  size  and  streaming  width 
only:  T(Bmn )  =  w/mn.  The  analysis  of  latency  and  throughput  for  this  combined  reuse  example 
includes  the  following  two  cases: 


•  Case  1:  Iterative  reuse.  This  case  occurs  when  d  <  k,  meaning  the  data  will  iterate  over 
the  internal  block  at  least  2  times.  As  discussed  in  Section  3.2,  the  internal  block’s  minimum 
latency  is  determined  by  its  throughput.  So,  if  d-L(An)  <  mn/w ,  buffers  are  added  until  they 
are  equal.  Thus,  internal  block  Bmn  has  latency  L(Bmn )  =  ma x(mn/w,  d  ■  L(An)).  The 
latency  of  the  whole  system  is  k/d  times  this,  giving  latency  =  max(mnk/dw,  k  ■  L{An)). 
Because  this  datapath  utilizes  iterative  reuse,  a  new  vector  cannot  enter  until  the  previous 
vector  begins  exiting  the  datapath,  so  the  throughput  (in  transforms  per  cycle)  is  the  inverse 
of  the  latency,  min (dw/mnk,  1  j(k  ■  L(An))). 

•  Case  2:  No  iterative  reuse.  This  case  occurs  when  d  =  k.  Now,  no  iterative  reuse  is 
performed;  the  data  only  passes  through  the  inner  block  once.  The  datapath  consists  of  d  =  k 
stages,  giving  latency  =  k  ■  L(An).  Because  the  data  never  feeds  back,  the  throughput  is 
limited  only  by  the  streaming  width,  giving  throughput  =  w/mn  transforms  per  cycle. 

These  equations  show  that  increasing  w  and  d  will  lead  to  lower  latency  and  higher  throughput 
in  equal  weights,  until  either  the  data  flows  so  quickly  that  the  latency  of  the  computation  dominates 
(d  ■  L(An)  >  mn/w),  or  d  increases  until  no  iterative  reuse  is  performed  (d  =  k). 
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Flexibility.  Additionally,  there  is  one  important  distinction  that  must  be  made  between  pa¬ 
rameters  d  and  w:  as  w  grows,  the  datapath  requires  greater  bandwidth  at  its  ports,  and  the  cost 
of  interconnect  and  multiplexers  increases.  In  the  simple  example  considered  here,  a  design  with 
w  =  2,  d  =  16  will  have  roughly  the  same  cost  and  performance  as  a  design  with  w  =  16.  d  =  2, 
but  the  latter  will  require  a  bandwidth  of  16  words  during  loading  and  unloading  phases,  while  the 
former  only  requires  two  words  per  cycle  (over  8x  more  cycles).  For  this  reason,  it  is  preferable  to 
increase  d  instead  of  w.  However,  d  must  divide  k  evenly  (k  is  typically  the  log2  of  the  transform 
size).  In  many  cases,  this  becomes  an  “all  or  nothing”  situation,  where  the  only  options  are  d  =  1 
and  d  =  k.  In  those  cases,  the  added  flexibility  provided  by  w  is  important. 

Lastly,  when  the  datapath  does  not  employ  iterative  reuse  (i.e.,  when  d  =  k),  the  designer 
typically  has  a  wider  choice  of  algorithms  because  the  internal  stages  are  not  required  to  be  uniform. 

Datapath  efficiency  and  vector  interleaving.  Assume  we  have  an  iterative  reuse  datapath  that 
reuses  block  Bn.  Here,  Bn  can  represent  any  datapath,  including  those  with  further  iterative  reuse 
internally.  Bn  has  an  inherent  latency  L(Bn )  and  throughput  T(Bn)  (determined  by  the  inverse  of 
the  minimum  initiation  interval  of  input  vectors). 

With  a  single  vector  recirculating  through  Bn,  the  effective  throughput  of  Bn  may  be  further 
limited  to  1  / L(Bn)  if  L ( Bn )  is  greater  than  the  minimum  initiation  interval.  In  this  case  the  head 
of  the  vector  is  still  inside  Bn  when  Bn’s  input  is  ready  to  accept  a  new  iteration. 

Let  R  be  a  utilization  ratio  of  the  effective  throughput  to  the  inherent  throughput  of  Bn ;  this 
quantifies  the  portion  of  Bn’ s  potential  throughput  that  is  utilized  in  the  system.  For  a  single  vector, 
R  =  (1  /L(Bn))/T(Bn). 

When  the  utilization  by  a  single  vector  is  sufficiently  low,  the  system  can  interleave  multiple 
vectors  to  make  use  of  the  full  throughput  capacity  of  Bn.  Formally,  if  R  <  l/V  (where  V  is  an 
integer),  the  system  may  interleave  V  computations  through  the  datapath,  increasing  the  effective 
throughput  and  thus  increasing  the  utilization  ratio  to  R'  =  ( V/  L  ( Bn  )  )/T(  Bn ) . 

In  some  cases,  a  designer  may  want  to  increase  L(Bn)  artificially  for  better  efficiency.  For 
example,  if  R  =  0.55,  the  designer  could  insert  delay  buffers  in  the  datapath  (increase  L(Bn )) 
until  R  is  reduced  to  0.5  and  then  interleave  two  vectors.  This  increased  utilization  yields  higher 
throughput  at  the  expense  of  added  latency,  so  the  designer’s  particular  application  requirements 
will  determine  the  suitability  of  this  approach. 
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3.5  Summary 

This  chapter  presented  a  formula-driven  hardware  paradigm  that  connects  a  set  of  sequential  hard¬ 
ware  structures  with  a  set  of  formula  constructs.  These  sequential  reuse  techniques  allow  multiple 
computations  from  an  algorithm  to  be  mapped  to  a  single  processing  element  in  the  datapath. 

First,  streaming  reuse  controls  the  width  of  the  data  vector  and  thus  the  parallelism  of  the  sys¬ 
tem.  Using  these  techniques,  streaming  implementations  require  that  the  computation  be  decom¬ 
posed  into  parallel  sub-blocks.  The  formula  Im  <g>  An  mathematically  captures  this  requirement  for 
streaming  width  w,  where  w  is  an  integer  multiple  of  n  and  w  <  mn.  This  idea  is  extended  when  A 
is  allowed  to  vary  with  an  index  variable:  Im  ®£  A™.  Now,  since  different  A  matrices  are  allowed, 
any  mn  x  mn  matrix  that  can  be  decomposed  into  parallel  n  x  n  subblocks  can  be  streamed  with 
width  w.  The  more  similarity  that  exists  among  the  different  the  A"  blocks,  the  lower  the  implemen¬ 
tation  cost,  because  in  that  case  the  final  hardware  block  can  share  computational  elements  between 
the  different  iterations. 

Similarly,  iterative  reuse  (IR)  controls  the  depth  of  the  datapath,  which  is  the  number  of  cas¬ 
caded  stages  from  input  to  output.  Designs  with  higher  depth  have  more  stages,  while  lower  depths 
correspond  to  fewer  stages  and  thus  more  iterations  of  the  data  vector  over  the  structure.  This  re¬ 
peated  stage  structure  is  written  as  ]”[(i  A™.  Note  that  the  n  x  n  matrix  A  can  depend  on  index  £. 
Thus,  any  product  of  matrices  can  be  iteratively  reused,  but  if  the  A"  matrices  are  not  related,  there 
is  no  benefit  from  using  IR.  For  example,  if  Aq  and  ,4“  are  completely  different  and  cannot  share 
logic,  then  there  would  be  no  benefit  to  using  IR  on  nJ=o  A™,  because  the  IR  version  would  have 
slightly  greater  area  (due  to  the  feedback  path),  the  same  or  worse  latency,  and  roughly  half  of  the 
throughput.  In  this  way,  the  structure  of  a  formula  is  deeply  connected  with  whether  or  not  iterative 
reuse  provides  a  benefit. 

Thus,  the  type  and  amount  of  sequential  reuse  that  is  desired  affects  the  amount  of  regularity 
that  is  beneficial  within  an  algorithm.  For  example,  the  Pease  FFT  algorithm  (2.13)  and  the  Iterative 
FFT  algorithm  (2. 15)  are  similar  except  the  Pease  algorithm  has  the  same  permutation  in  each  of  its 
stages,  while  the  Iterative  algorithm  does  not.  So,  when  the  user  requests  a  design  with  IR,  the  Pease 
algorithm  is  more  efficient  than  the  Iterative  algorithm.  In  this  way,  Spiral  does  not  search  to  select 
algorithms,  but  chooses  based  upon  how  well  an  algorithm’s  structure  fits  within  the  paradigm  the 
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user  has  requested. 


Chapter  4 


Automatic  Compilation  from  Formula  to 
Datapath 


Previous  chapters  presented  a  mathematical  language  for  describing  linear  transform  algorithms 
and  how  that  formula  space  can  be  refined  to  allow  explicit  specification  of  sequential  reuse.  This 
chapter  describes  how  these  concepts  are  used  to  drive  an  automatic  compilation  framework  that 
maps  a  transform  to  an  algorithmic  formula,  transforms  the  formula  to  one  that  includes  the  desired 
streaming  and  iterative  reuse  characteristics,  and  maps  that  hardware  formula  into  a  register-transfer 
level  (RTL)  Verilog  description. 

Figure  4.1  shows  a  high-level  view  of  each  of  the  steps  in  this  compilation  process.  First,  a 
transform  enters  the  system,  an  algorithm  is  selected,  and  a  formula  representation  of  that  algorithm 
is  produced.  Then,  formula  rewriting  is  used  to  apply  iterative  reuse  and  streaming  reuse  to  the 
formula;  the  resulting  “hardware  formula”  now  has  explicit  sequential  reuse.  Lastly,  the  hardware 
formula  is  then  translated  into  a  register-transfer  level  (RTL)  Verilog  description.  Below,  each  of 
these  steps  is  explained. 

4.1  Hardware  Directives 

This  system  uses  hardware  directives  to  include  information  about  the  desired  features  of  the  hard¬ 
ware  implementation  to  be  produced  by  the  compilation  framework.  Flardware  directives  are  tags 
placed  around  a  formula  or  portion  of  a  formula.  This  work  utilizes  two  directives.  First,  the 
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hardware 

directives 


transform 


RTL  Verilog 

Figure  4. 1 :  Block  diagram  of  hardware  compilation  system. 


streaming  tag  indicates  streaming  reuse: 


stream(iu) 


This  indicates  that  the  contents  of  A  should  be  restructured  such  that  the  resulting  hardware  formula 
will  be  implemented  in  a  block  that  contains  w  input  and  output  ports,  with  data  streamed  at  w 
elements  per  cycle,  over  n/w  cycles. 

Next,  the  depth  tag  indicates  whether  to  employ  iterative  reuse  (and  of  what  depth): 


depth(do,di ) 


This  tag  indicates  that  the  contents  of  An  should  be  modified  so  that  the  top-most  n  be  restructured 
to  have  iterative  reuse  with  depth  do,  its  next  nested  ]  [  with  depth  d\,  and  so  on. 

The  tag 


stream(ty),  depth(do,.-- ) 


represents  both  directives  at  once. 


4.2.  Formula  Generation:  From  Transform  to  Algorithm 
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4.2  Formula  Generation:  From  Transform  to  Algorithm 

The  formula  generation  stage  takes  in  a  transform  of  a  fixed  size,  applies  one  or  more  transform 
algorithms,  and  outputs  a  formula  that  specifies  the  complete  algorithm  in  the  formula  language 
defined  in  Section  2.2. 

This  stage  contains  the  system’s  knowledge  of  transform  algorithms,  as  detailed  in  Section  2.3. 
For  each  algorithm,  it  consists  of  a  set  of  conditions  under  which  it  is  applicable  and  the  parame¬ 
terized  formula  to  output.  Choices  made  at  this  stage  are  guided  by  the  aforementioned  hardware 
directives,  which  can  dictate  which  algorithms  are  appropriate  depending  on  the  parameters  chosen. 
Other  algorithmic  guidance  such  as  desired  radix  can  also  be  specified  by  the  user  at  this  time. 

Chapter  6  discusses  transform  algorithms  in  the  context  of  streaming  and  iterative  reuse,  and 
explains  which  algorithms  are  used  in  which  reuse  scenarios.  This  portion  of  the  compilation  frame¬ 
work  is  implemented  inside  of  Spiral  [4]. 

Example.  For  example,  if  the  desired  transform  is  a  128  point  DFT  with  radix  2,  width  4  and 
depth  1, 


DFT128  , 


stream(4),  depth(l) 


the  Pease  FFT  algorithm  (2.13)  will  be  utilized,  producing 


R128 


stream(4),  depth(l) 


Or,  the  user  could  specify  depth  7,  producing 


stream(4),  depth(7) 


Li28  n  (/e4  0  dft2)  (> 0  (Et 


t~  i 


stream(4),  depth(7) 


which  utilizes  the  Iterative  FFT  algorithm  (2.15)  with  radix  2.  Lastly,  the  user  could  specify  radix 
4  and  depth  3  and  the  system  would  first  use  the  Mixed- Radix  algorithm  (2.16)  to  decompose  the 
DFT  128  into  DFTg4  and  DFT2.  Then  it  would  use  the  Iterative  algorithm  (2.15)  to  implement 
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DFT64,  producing 

DFT128  ->■  Llf(I2  ®  DFT64)Li28r2128(I64  0  DFT2)L*f , 

stream(4),  depth(3)  stream(4),  depth(3) 

where  DFT64  is  decomposed: 


DFT64  -  Lf  (  n(A6  ®  DFT4)  (i#  ®  ( Ef~e  ■  Qf “*))  j  Rf, 
v=o  J 

and  DFT4  is  further  decomposed  using  (2.12). 

4.3  Formula  Rewriting:  Algorithm  to  Hardware  Formula 

Next,  the  algorithmic  formula  enters  the  formula  rewriting  stage,  which  takes  the  formula  plus 
hardware  directives  and  produces  a  hardware  formula,  which  is  a  restructured  and  annotated  version 
of  the  algorithm  that  includes  specification  of  sequential  reuse  as  discussed  in  Chapter  3.  Thus,  the 
output  of  this  stage  corresponds  directly  to  sequential  hardware  implementation. 

The  compilation  framework  accomplishes  this  restructuring  using  a  rewriting  system  that  takes 
in  a  tagged  formula,  and  using  a  set  of  rewriting  rules,  attempts  to  restructure  the  formula  or  propa¬ 
gate  the  tag  downward.  The  end  result  of  this  stage  is  a  formula  with  no  remaining  tags. 

In  addition  to  implementing  streaming  and  iterative  reuse,  rewriting  rules  within  this  section 
of  the  compiler  perform  simplifications  and  optimizations  to  improve  the  quality  of  the  generated 
design. 

4.3.1  Rewriting  for  Streaming  Reuse 

First,  Table  4.1  lists  the  rewriting  rules  that  the  system  utilizes  for  streaming  reuse.  Each  rule  takes 
a  formula  construct  with  the  stream  tag,  and  either  rewrites  the  formula  to  implement  streaming  or 
pushes  the  tag  inward  to  be  processed  at  the  next  level.  Each  of  the  rules  has  a  simple  explanation: 

•  base-SR:  If  the  size  of  a  matrix  is  the  same  as  the  desired  streaming  width,  the  stream  tag  is 


not  necessary  and  can  be  dropped. 
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name 

rule  condition 

base-SR 

^4?^  >  An 

stream  (n) 

product-SR 

A-n  *  Hn  *  *  ‘  Zn,  * 

stream(it;)  stream(it;)  stream(it;)  stream(it;) 

stream-IR 

v  ^  v  stream(it;) 

stream(it;) 

stream  1 

Im  0  Ak  -»•  ®sr  {jw/k  ®  Ak)  mk  >  w  and  k  <  w 

stream(u;) 

streaml-dep 

Im  A*  ->  /mfe/w  <8>f0  (lw/k  Akt 0.w/k+h)  mk  >  w  and  k  <  w 

stream(w) 

stream2 

Im  &  Aj%  *  Im  &  -^4/c  k  >  W 

stream(it;)  stream  (in) 

stream2-dep 

/m  ->  Im  ^  m 

stream(ui)  stream(iu) 

stream-diag 

Dn  — >  StreamDiag(.Dn,  ru)  u;  n 

stream(iu) 

stream-perm 

Pn  — ►  StreamPerm(Pra,  m)  u>  n 

stream(it;) 

Table  4.1:  Rewriting  rules  for  streaming  reuse. 
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•  product-SR:  If  a  product  of  matrices  is  tagged  for  streaming,  the  stream  tag  is  propagated  to 
each  individual  matrix. 

•  stream-IR:  Similarly,  a  stream  tag  is  propagated  into  the  inner  term  of  an  iterative  reuse 
product. 

•  streaml  and  streaml-dep:  If  the  size  of  A  is  less  than  or  equal  to  the  size  of  the  stream, 
the  rule  attempts  to  unroll  the  inner  tensor  product  to  match  the  stream  size.  If  A  depends  on 
index  variable  £,  the  same  rewriting  is  performed  except  the  dependence  must  now  change  to 
include  two  index  variables  £q  and  £\- 

•  stream2  and  stream2-dep:  If  the  size  of  A  is  larger  than  the  size  of  the  stream,  the  tag  is 
propagated  inward,  and  another  rule  must  restructure  A  to  the  right  streaming  width. 

•  stream-diag:  A  diagonal  to  be  streamed  is  rewritten  into  the  streaming  diagonal  form  (3.1) 
as  described  in  Chapter  3.2.1. 

•  stream-perm:  A  permutation  to  be  streamed  is  placed  in  a  streaming  permutation  wrapper, 
and  will  be  implemented  in  hardware  as  discussed  in  Chapter  5. 

4.3.2  Rewriting  for  Iterative  Reuse 

Next,  Table  4.2  shows  the  rules  used  for  iterative  reuse: 

•  base-IR:  If  the  formula  tagged  with  a  depth  tag  does  not  contain  any  product  terms,  the  depth 
tag  is  unneeded. 

•  drop-tag-IR:  When  the  depth  tag  does  not  have  any  terms  left  in  its  list,  it  is  dropped  because 
it  no  longer  holds  any  directives. 

•  product-IR  and  product-IR-dep:  When  a  product  term  with  a  depth  tag  is  encountered, 
iterative  reuse  is  applied,  with  an  inner  product  of  appropriate  depth  (here,  do)-  The  first  term 
(do)  is  then  removed  from  the  tag,  and  the  tag  is  propagated  inward.  Similarly,  if  the  inner 
formula  of  a  product  term  has  a  dependence  on  the  iteration  variable,  the  same  process  occurs, 
except  the  iteration  dependence  must  combine  the  two  product  variables  £  and  k. 
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name 


base-IR 


rule 


depth(do,... ) 


Ar 


condition 


A /,.  contains  no  Y 


drop-tag-IR 


product-IR 


product-IR-dep 


depth() 


m—  1 

II  An 

i=Q 

depth(do,^i,.” ) 


(m/do)- 1 

n  1 


£=0 


u 

n 

^fc=0depth(<ii,...) 


depth  tag  empty 


m/  do  integer 


m— 1 

n-4 ? 

^1=0 

depth(c2o^i  V ) 


(m/ d0)  — 1 

n  * 


t=o 


/*  \ 

II  ^e-do+k 

\  °  depth(fii .... )/ 


m/  do  integer 


(n 


A  ■ A  •••  4 


product-noIR  (II  Ar 
product-noIR-dep  (  J^[  .4"  )  — >•  Aft  ■  A"  ■  ■  ■  A”  _ , 


,1=0 


Table  4.2:  Rewriting  rules  for  iterative  reuse. 


43 


Chapter  4.  Automatic  Compilation  from  Formula  to  Datapath 


•  product-noIR  and  product-noIR-dep:  When  a  product  term  is  encountered  without  a  depth 
tag,  the  product  is  unrolled.  Similarly,  if  the  product  term  has  no  depth  tag  but  the  inner 
formula  depends  on  i,  the  £  variable  is  replaced  with  constants  0,1,...  as  the  product  term  is 
unrolled. 

Other  specialized  rules  are  not  needed  for  the  case  where  streaming  permutations  or  streaming 
diagonals  are  iteratively  reused  and  dependent  on  the  iteration  index.  In  the  case  of  permutations, 
the  formula  can  be  directly  implemented  as  described  in  Chapter  3.2.1  and  Chapter  5. 

Like  the  formula  generation  stage,  this  portion  of  the  compilation  framework  is  also  imple¬ 
mented  inside  of  Spiral. 

4.4  RTL  Generation:  Hardware  Formula  to  RTL  Verilog 

After  rewriting,  the  resulting  hardware  formula  is  expressed  in  Hardware  SPL,  an  extended  version 
of  the  language  presented  in  Section  2.2.  In  addition  to  the  terms  allowed  in  the  original  language, 
this  hardware  language  allows  tensor  products  to  have  explicit  streaming  reuse,  and  allows  product 
terms  to  have  iterative  reuse.  In  addition  to  diagonal  matrices  Dn  and  permutation  matrices  Pn,  the 
language  also  allows  their  streaming  variants  StreamDiag(Dri,  w)  and  S t re ai n P e r m  (  Pn .  w ) . 

The  RTL  generation  stage  takes  in  such  a  hardware  formula  and  produces  a  synthesizable 
register-transfer  level  (RTL)  Verilog  description  of  the  corresponding  datapath.  This  portion  of 
the  compilation  framework  is  partially  implemented  inside  of  Spiral  and  partially  implemented  as  a 
standalone  backend  that  runs  alongside  of  Spiral. 

4.4.1  Compilation  Overview 

There  are  several  classes  of  formula  constructs  to  consider  in  the  RTL  generation  stage.  This  section 
briefly  discusses  each  and  indicates  which  language  elements  of  the  hardware  SPL  language  are  in 
each  class. 

ip 

Combinational  Formula.  Any  portion  of  the  formula  without  iterative  reuse  (H  ),  streaming 
reuse  (10s1'  A)  or  explicit  sequential  meaning  (streaming  permutations)  can  automatically  be  mapped 
into  a  combinational  datapath,  as  discussed  in  Section  2.2.  So,  this  class  of  formulas  contains: 
computational  basic  blocks  ( An ),  unstreamed  diagonals  and  permutations  ( I)n  and  Pn),  matrix 
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products  without  iterative  reuse  (An  ■  Bn  . . . ),  and  tensor  products  without  streaming  reuse  (Im  0 
An).  Additionally,  note  that  the  inner  term  of  a  streaming  diagonal  StreamDiag(Dn,  w)  is  simply 
a  computational  basic  block  (see  Chapter  3.1.1),  so  its  inner  term  fits  within  this  set  as  well  (while 
its  outer  term  will  fall  under  streaming  reuse,  below). 

When  the  compiler  encounters  this  type  of  formula,  it  constructs  a  hardware  datapath  and  auto¬ 
matically  pipelines  the  path  by  inserting  staging  registers.  Additional  buffers  are  added  if  needed  to 
guarantee  that  corresponding  data  words  reach  the  output  ports  together  in  the  same  cycle. 

A  variation  on  this  class  of  formula  is  a  basic  block  with  dependence  on  an  index  variable,  which 
is  set  by  a  (/  0^  . . . )  or  \\f  formula.  In  this  case,  the  compiler  determines  the  possible  values  for 
£,  and  constructs  hardware  to  calculate  the  result  for  each.  The  compiler  attempts  to  simplify  the 
datapaths  as  much  as  possible  by  reducing  the  dependency  on  t  to  the  smallest  segments  possible. 

Streaming  Reuse.  A  streaming  reuse  tensor  product  Im  0sr  An  is  implemented  in  hardware 
as  one  An  block,  with  the  input  streamed  at  a  rate  of  n  words  per  cycle  over  m  cycles  (as  seen  in 
Figure  3.1(b)).  When  implementing  the  indexed  tensor  product  Im  A^^ ,  a  hardware  counter  is 
constructed,  and  index  variable  £  is  passed  to  the  subblock. 

Streaming  Diagonal.  A  diagonal  matrix  scales  each  element  in  the  input  vector  by  the  corre¬ 
sponding  value  from  the  diagonal  of  the  matrix.  As  shown  in  Section  3.1.1,  a  streamed  diagonal  is 
represented  as  an  indexed  tensor  product.  The  hardware  implementation  uses  w  multipliers,  where 
w  is  the  streaming  width.  Then,  the  n  values  from  the  diagonal  are  stored  in  w  lookup  tables,  which 
feed  the  multipliers  with  the  appropriate  data  at  each  cycle.  In  some  structured  diagonals  (for  ex¬ 
ample,  one  where  every  other  diagonal  entry  is  1),  the  system  can  automatically  reduce  the  cost  by 
simplifying  away  some  of  the  w  multipliers. 

Streaming  Permutation.  A  permutation  with  a  streaming  width  w  must  perform  an  n  point 
permutation  while  data  streams  in  and  out  of  the  system  at  a  rate  of  w  words  per  cycle.  In  Chap¬ 
ter  5,  this  problem  is  discussed  further  and  two  techniques  [19,  20]  to  construct  such  systems  are 
explained. 

Iterative  Reuse.  As  shown  in  Fig.  3.2(c),  an  iterative  reuse  structure  Of=o  's  built  with  an 
input  multiplexer  and  feedback  loop.  In  addition,  it  is  necessary  to  ensure  that  the  latency  through 
the  iteratively  reused  block  Am  is  sufficient  to  prevent  the  vector’s  “head”  from  colliding  with  its 
own  “tail”  at  the  input  multiplexer  (see  Chapter  3.2).  So,  the  compiler  must  determine  the  latency 
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(in  cycles)  through  Am  and  (if  necessary)  add  additional  buffering. 

In  the  simplest  case,  Am  does  not  depend  on  the  iteration  index  £,  and  the  only  additional 
hardware  required  are  the  feedback  path,  multiplexer,  and  multiplexing  control  logic.  However,  the 
compilation  framework  allows  some  matrix  formulas  to  be  iteratively  reused  with  a  dependence 
on  the  iteration  index.  For  example,  iterative  reuse  can  be  used  on  diagonal  matrices,  with  the 
diagonal’s  values  changing  with  the  iteration.  This  is  expressed  as: 

k-i 

£=0 

In  each  iteration,  the  diagonal  matrix  will  scale  the  n  elements  of  the  input  vector  by  n  constants, 
which  vary  with  £ .  So,  when  this  matrix  is  iteratively  reused,  the  datapath  will  still  contain  the  logic 
needed  to  scale  the  data  vector,  but  it  must  now  also  contain  lookup  tables  that  hold  all  i-n  constant 
values. 

Other  elements  that  may  be  depend  on  the  iteration  in  an  iterative  reuse  structure  in  this  manner 
include  streaming  permutations  (described  in  Chapter  5)  and  basic  blocks. 

Other  Functionality  of  the  RTL  Generation  Stage.  In  addition  to  translating  the  hardware 
formula  into  an  RTL  description,  the  RTL  generation  stage  includes  added  functionality.  For  ex¬ 
ample,  latency  and  throughput  (relative  to  the  cycle  time)  are  computed  for  all  blocks,  and  can  be 
reported  at  the  top  level  or  any  level  below.  Other  options  include  basic  cost  reporting  (e.g.,  count¬ 
ing  the  numbers  and  size  of  RAMs,  ROMs,  and  arithmetic  units).  In  previous  work  [21],  an  FPGA 
area  model  was  created  and  calibrated  for  a  subset  of  the  design  space  considered  here.  Such  an 
approach  may  be  able  to  be  extended  to  produce  an  accurate  FPGA  area  model  for  the  full  design 
space.  Lastly,  the  generator  is  also  able  to  include  basic  Verilog  testbenches  for  any  design  it  gener¬ 
ates.  With  one  command,  it  is  possible  to  go  from  transform  to  formula  to  RTL,  and  run  a  Verilog 
simulation. 

4.4.2  Compilation  Stages 

The  RTL  generation  stage  is  implemented  as  a  standalone  tool  that  takes  input  from  Spiral  and 
produces  register-transfer  level  Verilog.  The  following  are  the  stages  undertaken  as  it  translates 
from  a  Hardware  SPL  formula  to  a  hardware  implementation. 
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1.  Basic  block  compilation:  Spiral  takes  all  computational  basic  blocks  (dense  matrices)  and 
converts  each  into  a  directed  acyclic  graph  that  represents  the  computation  of  the  basic  block. 

2.  Backend  input:  The  backend  RTL  generator  reads  the  formula  and  basic  block  code  pro¬ 
duced  by  Spiral,  and  creates  an  intermediate  data  structure  that  contains  objects  for  each  class 

ip 

of  formula  element  (e.g.  ,  StreamPerm,  /< g)sr,  etc.). 

3.  Pipelining:  The  compiler  next  adds  pipelining  stages  to  all  basic  blocks  and  checks  that  the 
block’s  output  elements  are  synchronized  correctly.  If  they  are  not,  the  system  adds  buffers 
where  necessary. 

4.  Module  sizes:  This  stage  determines  the  input/output  sizes  (number  of  data  points)  for  each 
module.  Then,  it  checks  these  values  against  the  sizes  inferred  from  the  original  formula  for 
consistency. 

5.  Iteration  variable:  This  compilation  stage  resolves  all  iteration  variables  (e.g.,  indices  from 
\\f  ,4  or  I  (/)(,  A),  analyzes  them,  and  propagates  them  to  their  appropriate  submodules. 

6.  Variable  ranges:  This  stage  attempts  to  determine  the  minimum  and  maximum  values  of 
each  variable  in  the  design’s  basic  blocks.  Often,  variables  that  are  computed  from  iteration 
counters  or  computed  values  from  lookup  tables  can  be  bound.  In  particular,  this  information 
will  become  useful  in  simplifying  conditional  statements  and  pruning  unnecessary  computa¬ 
tion. 

7.  Data  type:  Next,  the  compiler  determines  the  data  type  of  all  variables,  including  the  number 
of  bits  of  precision  for  fixed  point  types.  This  information  can  be  inferred  based  upon  the  input 
data  type  and  the  computations  that  are  performed.  Furthermore,  the  compiler  implements 
scaling  at  this  stage,  if  the  user  has  requested  a  scaled  fixed  point  implementation. 

8.  Lookup  tables:  This  stage  analyzes  the  constant  lookup  tables  that  are  used  when  imple¬ 
menting  streaming  diagonal  matrices.  The  compiler  examines  the  tables,  removing  trivial, 
unused,  or  redundant  ones,  and  updating  references  to  them  appropriately. 

9.  Cleanup:  Next,  the  compiler  performs  a  simplification  and  cleanup  step.  This  stage  merges 
identical  arithmetic  operations,  removes  trivial  ones,  and  performs  other  simplifications.  For 
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example,  code  implementing  the  following  equations: 


to  —  o-o  +  bo 
ti  =  a0  +  bo 
t2  =  a±  x  1 
f3  =  a2  x  0 
tA  =  t3  +  fi 

would  be  simplified  to: 

to  =  «0  +  &0 
t2  =  at 
tA  =  to- 

10.  Latency  and  buffering:  This  stage  uses  formulas  like  those  seen  in  Section  3.3  to  determine 
the  latency  of  each  module  in  the  design.  As  explained  in  Section  3.2,  modules  that  are  it¬ 
eratively  reused  have  a  minimum  latency  requirement  to  keep  the  head  of  the  vector  from 
colliding  with  its  tail.  Once  latency  is  determined,  the  compiler  checks  to  see  if  this  require¬ 
ment  is  met;  if  it  is  not,  the  compiler  adds  buffers  to  the  datapath.  (If  a  streaming  permutation 
is  used  within  the  IR  block,  some  or  all  of  the  buffering  can  be  inserted  into  the  streaming 
permutation  at  no  added  cost.) 

11.  Throughput:  Next,  this  stage  finds  the  throughput  of  each  module,  in  order  to  determine  the 
overall  throughput  of  the  system,  which  becomes  part  of  the  input/output  specification  for  the 
end  user.  The  compiler  calculates  throughput  using  formulas  like  those  seen  in  Section  3.3. 

12.  BRAM  allocation  (optional):  If  the  user  is  targeting  an  FPGA,  he  or  she  may  request  that 
block  RAMs  be  used  and  present  the  compiler  with  a  budget.  In  this  situation,  the  com¬ 
piler  will  examine  all  of  the  memory  structures  needed  within  the  design  and  choose  the  best 
allocation  of  the  allowed  number  of  BRAMs. 
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13.  Resource  estimation  (optional):  If  requested  by  the  user,  the  compiler  can  perform  coarse¬ 
grained  resource  estimation,  determining  the  number,  data  type,  and  precision  of  adders,  mul¬ 
tipliers,  RAMs  and  ROMs. 

14.  Testbench  generation  (optional):  Based  on  the  design’s  data  type,  latency,  and  throughput, 
the  compiler  can  generate  a  Verilog  testbench  suitable  for  simulation.  The  testbench  outputs 
data  in  a  format  that  can  be  read  by  Spiral  in  order  to  verify  the  correctness  and  quantify  the 
error. 

15.  Output:  Lastly,  the  compiler  outputs  a  synthesizable  register  transfer  level  Verilog  imple¬ 
mentation. 


Chapter  5 


Permuting  Streaming  Data 


Permutations  are  crucially  important  portions  of  linear  transform  algorithms,  but  they  are  difficult 
to  perform  in  hardware  structures  that  operate  on  streaming  data,  such  as  those  described  in  Sec¬ 
tion  3.1.  Prior  work  addresses  this  problem  in  a  limited  way  for  a  subset  of  permutations  and  a 
subset  of  problem  sizes.  This  chapter  explores  the  problem  of  streaming  permutations,  explains  the 
prior  work,  and  introduces  a  new  general  technique  that  can  implement  any  fixed  permutation  of 
any  size. 


5.1  Problem 

A  permutation  is  a  fixed  reordering  of  a  given  number  of  data  elements.  Pn  represents  a  permu¬ 
tation  of  an  n  point  data  vector.  Figure  5.1(a)  shows  an  example  of  a  12  point  permutation  P\ 2 . 
Data  elements  (0, . . . ,  11)  enter  concurrently  from  the  left,  are  reordered,  and  exit  in  permuted  or¬ 
der.  A  hardware  implementation  of  such  a  permutation  is  trivial  if  all  n  data  points  are  available 
concurrently:  it  is  simply  built  as  a  reordering  of  wires. 

However,  when  a  permutation  must  be  performed  on  streaming  data,  the  implementation  is 
more  difficult.  Figures  5.1(b)  and  (c)  illustrate  a  streaming  version  of  the  permutation  seen  in 
Figure  5.1(a).  Here,  n  =  12  data  words,  and  the  streaming  width  is  w  =  3  data  words  per  cycle. 
In  Figure  5.1(b)  the  data  labeled  (0, 1,  2)  enter  the  system  on  the  first  cycle,  (3,4,  5)  enter  on  the 
second,  and  so  on,  for  a  total  of  n/w  =  4  cycles.  Figure  5.1(c)  shows  the  structure’s  output,  where 
the  12  words  flow  out  of  the  system  in  their  new  permuted  order  (5,  2, 3, 0, . . . ),  again  with  w  =  3 
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(a)  Permutation 


(b)  Streaming  input 


(c)  Permuted  streaming  output 


Figure  5.1:  Examples:  permutation  and  streaming  permutation. 


words  per  cycle. 

Performing  a  permutation  on  streaming  data  is  difficult  because  data  must  be  reordered  across 
time  boundaries  by  storing  and  retrieving  from  a  memory.  For  example,  the  element  labeled  1  in 
Figure  5.1(b)  streams  into  the  system  during  the  first  cycle  of  the  input  stream,  but  must  be  buffered 
until  the  third  cycle  of  the  output  stream  (in  Figure  5.1(c)). 

A  method  for  implementing  streaming  permutations  must  be  able  to  scale  as  w  (the  number  of 
data  words  per  cycle)  increases.  Thus,  using  one  large  (/’-ported  memory  is  infeasible;  instead,  a 
solution  must  use  multiple  memories  and  partition  the  problem  across  them. 

5.2  Existing  Approaches 

Stride  Permutations.  Most  of  the  existing  work  in  the  literature  on  streaming  permutations  focuses 
on  the  stride  permutation  L.  For  example,  [22]  generates  hardware  implementations  of  Lp  with 
streaming  width  2k  using  a  number  of  small  FIFO  memories  with  interconnection  networks.  Al¬ 
though  this  technique  uses  the  minimal  storages  for  a  given  permutation,  its  usage  of  multiple  small 
independent  memories  leads  to  high  overhead  and  complex  control.  In  another  example,  [23]  is 
able  to  perform  Lp  with  streaming  width  2k  where  k  >  n  —  s  using  memories  and  interconnection 
networks.  Other  similar  techniques  include  [24]. 
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Figure  5.2:  Block  diagram  of  streaming  permutation  structure  (n  points,  w  words  per  cycle)  built 
using  the  bit  matrix  method. 

Data  Format  Converters.  Parhi  [25]  introduces  a  technique  to  design  a  register-based  data 
format  converter  that  can  be  used  to  implement  a  streaming  permutation.  This  method  uses  a  register 
allocation  algorithm  to  design  a  datapath  consisting  of  individually-controlled  registers  connected 
with  wires  and  switches.  This  technique  is  general;  it  can  be  used  on  any  streaming  permutation. 
However,  its  need  to  provide  control  logic  to  all  of  the  individual  registers  and  connection  networks 
leads  to  considerable  overhead. 

Bit  Matrix  Method.  In  [20],  Puschel,  Milder  and  Hoe  present  a  method  for  generating  stream¬ 
ing  permutation  circuits  for  a  subset  of  all  permutations  and  streaming  widths.  Figure  5.2  shows 
a  high-level  view  of  the  designs  produced  using  this  technique,  which  are  comprised  of  two  con¬ 
nection  networks  and  an  array  of  RAMs.  All  storage  is  consolidated  within  w  parallel  RAMs,  each 
which  must  hold  2 n/w  data  words.  This  method  does  not  require  pre-computed  control  values — 
instead,  all  control  is  computed  online  with  inexpensive  bit  operations.  Additionally,  the  connection 
networks  used  by  this  technique  are  optimized  for  the  given  permutation. 

The  method  in  [20]  works  by  representing  permutations  as  bit  matrices — linear  mappings  on 
the  bit-level  representation  of  data  input  and  output  locations.  The  permutation's  bit  matrix  is  then 
decomposed  in  a  way  that  fully  specifies  the  datapath  and  control  logic  for  a  given  permutation. 
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Further,  the  method  optimizes  the  required  connection  networks  and  control  logic;  provable  lower 
bounds  are  reached  for  an  important  subset  of  problems. 

However,  this  technique  does  not  apply  to  all  permutations:  only  a  subset  can  be  represented  as 
bit  matrices,  and  the  technique  requires  the  problem  size  n  and  streaming  width  w  to  be  powers  of 
two. 


5.3  General  Streaming  Permutations 

This  section  presents  a  parameterized  datapath  and  accompanying  mapping  technique  that  is  able 
to  be  configured  to  perform  any  fixed  streaming  permutation.  Similar  to  [20],  this  technique  uses 
parallel  memories  and  switching  networks;  it  is  scalable  with  the  problem  size  and  the  streaming 
width.  Unlike  [20],  this  technique  applies  to  all  streaming  permutations  and  does  not  force  the 
problem  size  or  streaming  width  to  be  a  power  of  two.  This  work  was  first  presented  in  [19]. 

5.3.1  Parameterized  Datapath 

A  key  aspect  of  this  work  is  the  identification  of  a  scalable  datapath  structure  that  can  be  configured 
to  perform  any  fixed  streaming  permutation  with  any  streaming  width.  This  section  presents  this 
parameterized  datapath  and  discusses  its  operation. 

Figure  5.3  shows  the  datapath  considered.  It  consists  of  two  memory  arrays  Mo  and  Mi,  a 
connection  network  N,  and  lookup  tables  (ROMs)  that  contain  address  and  control  data  (II.  T,  and 
W). 

Each  memory  array  contains  w  parallel  memories,  each  with  capacity  2 n/w.  Each  has  one  read 
port  and  one  write  port.1  Memory  arrays  Mo  and  M\  are  connected  to  lookup  tables  R  and  W 
(respectively),  which  hold  pre-computed  address  values. 

N  represents  a  connection  network  that  is  capable  of  taking  w  data  points  as  input  and  outputting 
them  in  any  given  order,  with  ROM  T  storing  its  pre-computed  control  data.  When  w  is  a  power  of 
two,  this  network  can  be  built  as  explained  in  [26,  27].  The  resulting  network  utilizes  w  log2(m)  — 
w  +  1  2-by-2  switches,  which  is  asymptotically  optimal.  If  w  is  not  a  power  of  two,  N  could 

1  It  is  also  possible  to  replace  the  2 n/w  word  dual  ported  memory  with  two  single  ported  memories  of  size  n/w. 
Then,  data  is  written  into  one  memory  while  it  is  read  out  of  the  other.  This  optimization  may  be  beneficial  on  an  ASIC, 
where  the  designer  can  build  precisely  the  necessary  memory  structures.  However,  this  transformation  is  not  beneficial 
on  most  current  FPGAs,  which  typically  have  embedded  dual-ported  memories. 
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Figure  5.3:  Block  diagram  of  streaming  permutation  structure  (n  points,  w  words  per  cycle)  built 
using  the  general  permutation  method. 

be  constructed  in  several  ways;  the  simplest  is  to  build  the  network  for  the  next  largest  power  of 
two.  However,  any  other  appropriate  interconnection  network  or  crossbar  may  be  substituted.  This 
decision  does  not  affect  the  other  portions  of  the  datapath  (Mo,  Mi,  R,  and  W). 

As  data  flows  through  this  structure,  it  passes  through  five  stages: 

1.  Write  into  Mo.  Data  flows  into  the  system  and  is  written  in  order  to  the  banks  of  Mq. 
Element  i  is  written  into  address  [i/w\  in  bank  i  (mod  w).  Address  values  are  generated  with  a 
|’log2(ra/tn)l  bit  counter. 

2.  Read  from  Mo-  On  each  cycle  of  this  phase,  one  word  is  read  from  each  of  the  w  memory 
banks  of  Mq.  The  read  addresses  ([log 2(n/w)~\  bits  each)  are  pre-computed  and  stored  in  lookup 
table  R.  Each  line  of  R  holds  the  w  memory  read  addresses  for  a  given  cycle.  Thus,  R  contains 
n/w  lines,  and  each  line  requires  w  ■  [log2(n/iu)~|  bits. 

3.  Connection  network.  The  connection  network  N  takes  in  w  elements  and  outputs  them  in 
a  permuted  order.  The  design  tool  must  pre-compute  the  values  that  will  control  the  network,  and 
store  them  in  T,  which  contains  n/w  configurations,  with  w'  log2  w'  —  w'  +  1  bits  per  configuration 
(where  w'  =  2  ^2  “l). 

4.  Write  to  Mi.  On  each  cycle  of  this  phase,  one  word  is  written  into  each  of  the  w  memory 
banks  of  M\.  The  write  addresses  are  pre-computed  and  stored  in  lookup  table  W,  each  line  of 
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which  holds  w  write  addresses.  So,  W  contains  n/w  lines,  each  of  width  w  ■  |~log2(ro/u;)]  bits. 

5.  Read  from  M\.  Data  are  read  from  Mi  in  order  and  flow  out  of  the  system.  Element  i  is 


read  from  address  [i/w J  of  bank  i  (mod  w).  Address  values  are  generated  with  a  \\og2{n/w)~\  bit 


counter. 


In  order  to  maintain  full  throughput  across  multiple  problems,  Mo  and  M\  are  each  sized  to 
hold  a  total  of  2 n  words.  Then,  n  words  can  be  written  into  addresses  (0, . . . ,  n/w  —  1)  of  each 


bank  while  the  previous  n  words  are  being  read  from  addresses  (n/w, ....  2n/w  —  1).  The  design 


accomplishes  this  by  using  a  one  bit  register  to  determine  whether  or  not  to  add  an  offset  to  the 
addresses  flowing  into  the  RAMs.  Every  n/w  cycles,  this  bit  is  complemented. 

The  effect  of  this  is  that  up  to  three  n  point  data  vectors  can  be  active  in  the  structure  at  one 
time.  One  set  of  n  data  points  can  be  flowing  into  Mo  (step  1),  while  a  second  set  flows  between 
Mo  and  M\  (steps  2-4),  while  a  third  set  of  n  points  flows  out  of  M\ . 

An  important  aspect  to  understand  about  this  datapath  is  that  all  of  the  reordering  is  done  in 
stages  2-4.  Stage  1  writes  data  to  Mo  in  natural  order,  and  Stage  5  reads  data  from  Mi  assuming 
it  is  already  in  permuted  order.  The  problem  then  becomes:  given  a  permutation  Pn,  how  does  one 
guarantee  that,  on  each  cycle,  one  can  read  w  words  from  Mo  and  write  them  to  the  correct  locations 
of  Mi  without  conflicts  (i.e.,  needing  to  read/write  multiple  words  from/to  the  same  RAM  at  the 
same  time)?  A  solution  to  this  problem  implies  a  datapath  configuration  that  will  be  capable  of 
performing  Pn  with  full  throughput  (w  words  per  cycle)  without  stalling.  Section  5.3.2  formalizes 
this  problem  and  derives  a  solution. 

Extension  to  support  multiple  permutations.  This  method  can  also  be  used  if  multiple  per¬ 
mutations  are  to  be  performed  by  one  iterative  reuse  structure.  For  example,  the  formula 


k- 1. 


performs  a  different  permutation  in  each  of  the  k  iterations.  So,  this  can  be  implemented  by  increas¬ 
ing  the  size  of  the  pre-computed  ROMs  by  a  factor  of  k,  and  adding  an  extra  input  for  the  iteration 
variable  £  to  determine  which  of  the  k  configuration  sets  to  use. 


Tensor  product  of  permutations.  Often,  algorithms  permute  n  points  by  performing  a  smaller 
k  point  permutation  over  n/k  consecutive  regions  of  the  vector  (k  <  n,  and  k  and  n/k  are  integers). 


55 


Chapter  5.  Permuting  Streaming  Data 


As  a  formula,  this  is  written  In/k  ®  Pn.  When  streamed  with  width  w  where  w  <  k,  this  becomes 


4 Ik  4/fc  ®sr  Pk  ->  4/fc  ®sr  StreamPerm(Pfe,  w) 

\ 


Although  this  is  still  a  permutation  on  n  total  points,  the  implementation  cost  is  equal  to  a  k  point 
streaming  permutation. 

These  structures  occur  frequently  in  transform  algorithms.  For  example,  each  stage  (parameter¬ 
ized  by  £)  of  the  Iterative  FFT  (2.15)  has  permutation: 


Ire  ®  Prt-e 


In  iteration  £,  this  performs  rf  permutations  on  rl  1  points.  Thus,  when  this  formula  is  streamed, 
the  stages  corresponding  to  larger  values  of  £  are  less  expensive  to  implement. 

5.3.2  Problem  Formulation 

This  section  formulates  a  mathematical  problem  that  corresponds  to  mapping  a  streaming  permuta¬ 
tion  to  the  datapath  presented  in  Section  5.3.1.  Further,  it  demonstrates  that  this  problem  is  solvable 
for  all  permutations  and  streaming  widths. 

As  discussed  in  the  previous  section,  the  ordering  of  the  data  inside  both  memory  arrays  is  fixed: 
in  Mo,  it  must  be  in  natural  order  and  in  Mi,  it  must  be  in  permuted  order.  The  key  problem  then  is 
to  choose  w  words  each  cycle  that  are  read  from  the  w  different  ports  of  Mo  that  must  be  written  to 
the  w  different  ports  of  M\.  Formally,  the  problem  is  defined  as  follows: 

Problem  1.  Given  are  a  permutation  Pn  on  I  =  {0, . . .  ,n—  1}  and  a  streaming  width  w.  For  each 
of  the  n/w  time  steps  j,  j  =  0, . . . ,  n/w  —  1,  find  a  subset  Sj  C  /  containing  precisely  w  points 
such  that 

1.  the  union  of  all  Sj  is  I  (which  implies  that  the  Sj  are  pairwise  disjoint);  and 

2.  for  every  Sj  and  every  k,£  £  Sj  where  k  £: 


k  £  (mod  w)  and  Pn(k)  Pn{£)  (mod  w). 


A  solution  to  Problem  1  will  allow  us  to  read  (in  cycle  j)  w  elements  (specified  by  Sj)  from  Mq 
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and  write  them  to  M\.  By  construction,  the  w  words  will  be  read  from  w  distinct  memory  banks 
in  Mo  and  written  to  w  distinct  memory  hanks  in  M\.  Since  there  are  no  conflicts  at  the  read/write 
ports,  the  system  can  sustain  a  full  throughput  of  w  words  per  cycle  as  desired.  Thus,  a  solution  to 
this  problem  implies  values  for  R,  W,  and  T  that  will  allow  the  datapath  in  Figure  5.3  to  perform 
Pn- 

Next,  we  transform  Problem  1  into  a  form  that  enables  its  solution.  We  start  by  defining  a 
mapping  itw  that  collects  the  set  of  connections  needed  between  the  output  of  Mq  and  the  input  of 
Mi  in  a  matrix. 

Definition  2.  The  mapping  irw  takes  as  input  annxn  permutation  matrix  Pn  and  outputs  awxw 
matrix  of  integers.  Ifitw(Pn)  =  [ck,e  \  k,  t  =  0, . . . ,  w  —  1],  then 

Ck,i  =  |{x  G  I  |  x(mod  w)  =  i  and  Pn(x)(mod  w)  =  k}\. 

In  words,  the  (k,  t)  element  of  itw  ( Pn )  gives  the  number  of  data  words  to  be  read  from  port  l  of 
Mo  and  written  to  port  k  of  Mi . 

For  example,  consider  the  permutation 

Pi2  =  (0, . . . ,  11)  -f  (3,  7, 1,  2,  6,  0, 11,  9, 4, 10, 8,  5).  (5.1) 


This  permutation  corresponds  to  the  example  seen  in  Figure  5.1.  If  w  =  3  ports,  then 


7T3(-Pl2) 


( l  2  Is 
1  1  2 

\2  1  V 


(5.2) 


Lemma  3.  The  sum  of  all  elements  in  a  given  row  or  column  of  it w{Pn)  is  n/w.  A  matrix  with  this 
property  is  called  a  semi-magic  square. 


Proof.  This  follows  from  Definition  2  and  the  assumption  that  one  word  streams  into  each  input 
(and  out  of  each  output)  on  each  of  n/w  cycles.  □ 


Lemma  4.  nw(Pn )  can  be  decomposed  into  a  sum  of  n/w  many  wxw  permutation  matrices,  some 
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of  which  may  be  repeated. 

Proof.  This  follows  from  [28],  which  proves  that  any  semi-magic  square  can  be  decomposed  into  a 
sum  of  permutation  matrices.  The  total  number  must  be  n/w  due  to  Lemma  3.  □ 

We  represent  this  decomposition  as 

ttw{.Pn)  =  PoQo  +  PlQl  +  '  '  ‘  +  (3k-lQk-h  (5-3) 

where  the  Qi  are  permutation  matrices  and  the  f3t  are  integer  constants  >  1  with  /Ti  ^  =  n/w. 

Each  of  the  Qi  permutations  represents  a  particular  configuration  of  the  switching  network  N 
in  Figure  5.3.  So,  one  can  choose  w  corresponding  words  from  Mq  (one  from  each  bank),  permute 
them  by  Qi,  and  write  them  into  the  correct  locations  of  Mi  (one  word  to  each  bank).  Thus,  the 
decomposition  in  (5.3)  represents  a  solution  to  Problem  1 . 

Continuing  the  example  from  (5.1)  and  (5.2),  we  can  decompose  tt,,, ( P\ 2 )  as 


( \  0  0^ 

^0  1  0^ 

t-H 

O 

O 

Kw(Pl2)  = 

0  1  0 

+  2 

0  0  1 

+ 

1  0  0 

0 

0 

h-> ■ 

O 

O 

1°  1  V 

Lemma  5.  trw(Pn)  has  a  decomposition  (5.3)  that  satisfies 

k  <  min(u;2  —  2 w  +  2,  n/w). 

Proof.  [29]  gives  an  upper  bound  of  w2  —  2w  +  2  permutations.  We  can  further  tighten  this  bound 
to  min(u;2  -  2w  +  2,  n/w),  since  Lemma  4  shows  that  there  are  at  most  n/w  permutations.  □ 

Lemma  5  shows  that  even  though  there  are  w\  many  w  x  w  permutation  matrices,  a  much 
smaller  number  is  needed  in  (5.3).  For  the  proposed  datapath  (Figure  5.3)  this  could  lead  to  a 
reduced  storage  requirement  for  T  since  some  of  the  k  settings  may  be  used  multiple  times  (if 
k  <  n/w).  Taking  advantage  of  this  would  require  storing  the  values  of  /3i  and  adding  additional 
logic  to  determine  when  to  increment  T’s  address  value.  This  optimization  would  reduce  the  number 
of  elements  in  T  from  n/w  to  k,  and  hence  may  reduce  the  overall  hardware  cost  for  some  problems 
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(if  the  savings  from  storing  fewer  words  offsets  the  added  cost  of  storing  the  /3j  and  the  added  logic). 
This  optimization  is  not  explored  here. 

5.3.3  Algorithm 

Based  upon  the  problem  specified  in  Section  5.3.2,  we  formulate  an  algorithm  that  calculates  the 
parameters  needed  to  perform  a  given  permutation  Pn  with  a  streaming  width  w  on  the  datapath  in 
Figure  5.3.  This  algorithm  is  executed  at  design  time;  it  determines  the  control  values  to  be  stored 
into  ROMs  R,  T,  and  W. 

This  algorithm  computes  a  decomposition  of  irw(Pn)  of  the  form  (5.3)  and  calculates  the  corre¬ 
sponding  values  to  store  in  R ,  T,  and  W.  Recall,  It  and  W  denote  the  collection  of  addresses  for  Mo 
and  Mi,  respectively.  L  =  (Qo- . . . ,  Qn/W~  i)  represents  the  list  of  permutations  that  the  connection 
network  must  perform,  and  T  represents  the  configuration  bits  associated  with  each  permutation  in 
L  (computed  using  the  methods  in  [26,  27]). 

Algorithm  6.  Input:  Pn  and  w.  Output:  R,  T,  and  W. 

1.  C  <—  nw(Pn)-,  (as  described  in  Definition  2). 

2.  while  {C  contains  non-zero  entries)  do 

•  find  a  permutation  Q  included  in  C ; 

•  while  C  —  Q  contains  no  negative  entries  do 

-  C<-C-Q-, 

-  append  permutation  Q  to  L; 

-  find  (xo, . . . ,  xw-i)  s.t.  (x,:(mod  w)  =  i)  and  (Pn(xj)(mod  w)  =  Q{i))\ 

-  append  {[xi/w\),  0  <  i  <  w  to  72; 

-  append  ([Pn(xQ-i^)/w\),  0  <  *  <  w  to  W; 

3.  calculate  T  based  on  the  permutations  stored  in  L,  using  the  techniques  in  [26,  27]; 

4.  output  R,  T,  and  W ; 

The  most  computationally  difficult  part  of  Algorithm  6  is  finding  a  permutation  Qw  that  may 
be  subtracted  from  Cw.  This  can  be  accomplished  in  three  ways:  using  a  brute  force  algorithm, 
mapping  the  problem  to  a  satisfiability  problem,  or  using  an  algorithm  based  on  systems  of  distinct 
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name 

type 

ports 

#  needed 

#  words 

bits  per  word 

M0 

RAM 

2 

w 

2  n/w 

b 

Mi 

RAM 

2 

w 

2n/w 

b 

R 

ROM 

1 

1 

n 

w  •  \log2(n/w)] 

W 

ROM 

1 

1 

n 

w  •  \log2(n/w)] 

T 

ROM 

1 

1 

n/w 

w'  log2  w'  —  w'  +  1 

Table  5.1:  Summary  of  memories  required  for  streaming  permutation  structures. 

representatives  (e.g.,  [30,  Ch.  5]).  This  implementation  uses  the  satisfiability  approach;  it  is  able  to 
solve  practical  problem  sizes  quickly.  For  the  largest  problem  in  this  evaluation  (n  =  4096,  w  = 
64),  Algorithm  6  completes  in  approximately  6  minutes. 

5.4  Evaluation 

This  section  discusses  the  implementation  of  the  proposed  general  streaming  permutation  method 
and  evaluates  them  by  comparing  with  the  results  of  [20].  A  tool  based  upon  the  techniques  dis¬ 
cussed  in  this  section  is  implemented  as  an  automated  generation  tool  (built  as  a  module  within  the 
main  compiler  described  in  Chapter  4).  This  tool  takes  as  input  a  permutation  Pn  and  a  streaming 
width  w.  and  outputs  a  register-transfer  level  Verilog  description  of  the  design.  The  tool  generates 
an  instance  of  the  parameterized  datapath,  and  uses  Algorithm  6  to  determine  the  values  to  store  in 
the  lookup  tables.  Additionally,  the  bit-permutation-based  technique  from  [20]  is  also  implemented 
as  another  module  within  the  compilation  tool. 

Analysis  of  Generated  Designs.  Table  5.1  summarizes  the  number  and  size  of  memories 
needed  by  the  proposed  solution.  Additionally,  the  datapath  requires  w'  log2  w'  —  w'  +  1  two- 
input  switches,  where  w'  =  2  rioS2  H  .  jn  general,  the  cost  of  this  implementation  depends  only  on  n 
(the  number  of  points  in  the  data  vector)  and  w  (the  streaming  width  of  the  system).2 

Designs  produced  using  this  method  have  a  throughput  of  w  words  per  cycle,  and  a  latency  of 
2 n/w  +  [log2(ru)l  +  3  cycles. 

Experimental  Evaluation.  This  section  evaluates  the  cost  and  performance  of  the  proposed 
generated  designs  when  synthesized  for  a  Xilinx  Virtex-5  FPGA.  We  use  Xilinx  ISE  9.2  to  synthe- 

2The  cost  of  implementation  can  only  be  further  reduced  in  rare  cases  where  all  or  part  of  a  lookup  table  or  connection 
network  is  redundant. 
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throughput  [gigabits  per  second] 

300 

— n=64 
^^n  =  512 


Figure  5.4:  Throughput  versus  slices  for  n  =  64,  512, 4096.  Labels:  number  of  BRAMs. 

size  and  place/route  all  designs,  and  we  extract  all  timing  and  area  measurements  after  place/route 
has  completed.  In  this  evaluation,  the  data  words  arc  b  =  16  bits  wide. 

Virtex-5  FPGAs  contain  on-chip  memory  structures  called  block  RAM  (BRAM).  When  the 
design  requires  a  RAM  or  ROM  of  size  >  1024  bits,  the  system  maps  it  to  a  BRAM.  Smaller 
memories  are  created  out  of  the  FPGA’s  reconfigurable  logic  elements.  This  threshold  (1024  bits) 
is  a  parameter  of  the  generation  tool. 

This  evaluation  quantifies  the  cost  and  performance  of  the  following  configurations  of  this  dat¬ 
apath:  n  =  64  with  w  =  2, 4,  8, 16, 32,  and  n  =  512, 4096  with  w  =  2, 4,  8, 16, 32,  64.  Because 
the  cost  of  the  implementation  does  not  depend  on  the  specific  permutation  being  performed,  we 
choose  one  randomly  for  each  n.  Figure  5.4  shows  throughput  (in  gigabits  per  second)  versus  FPGA 
area  (in  slices,  which  arc  the  reconfigurable  blocks  of  the  FPGA)  for  all  designs.  Each  line  shows  a 
different  value  of  n  (the  problem  size),  and  the  different  points  within  a  line  correspond  to  different 
values  of  streaming  width  w:  the  left-most  point  corresponds  to  w  =  2;  w  doubles  with  each  suc¬ 
cessive  point.  (Recall,  the  throughput  of  each  design  is  w  words  per  cycle.)  Additionally,  we  label 
several  of  the  points  with  the  number  of  block  RAMs  (BRAMs)  used  by  that  design. 

Figure  5.4  shows  that  as  w  increases,  the  resulting  designs  exhibit  higher  throughput,  but  be- 
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come  commensurately  more  expensive  (in  area).  As  n  increases,  the  results  do  not  show  a  significant 
change  in  throughput  or  area,  but  a  large  increase  in  the  number  of  BRAMs  required.  This  occurs 
because  the  size  of  all  memories  grows  with  n  (as  shown  in  Table  5.1). 

In  order  to  provide  a  reference  point  for  comparison,  these  designs  are  compared  to  [20],  which 
describes  a  method  for  generating  streaming  permutation  circuits  for  a  subset  of  all  permutations 
and  streaming  widths.3  The  designs  produced  by  [20]  utilize  one  memory  array  with  interconnec¬ 
tion  networks  at  its  inputs  and  outputs;  both  networks  are  optimized  for  the  specific  permutation 
considered.  Furthermore,  all  memory  and  switch  configurations  are  calculated  online;  no  lookup  ta¬ 
bles  are  used.  For  an  important  class  of  problems,  [20]  is  able  to  produce  designs  with  the  optimum 
address  logic  and  switching  network  (given  the  assumed  architecture). 

However,  the  technique  in  [20]  is  only  applicable  to  a  small  subset  of  streaming  permutations. 
The  goal  is  not  to  improve  on  [20] ’s  cost/performance  tradeoff;  it  instead  serves  as  a  way  to  measure 
the  added  costs  incurred  by  moving  to  the  general  structure. 

The  area  required  by  the  designs  in  [20]  depends  on  the  permutation  being  performed  (as  well  as 
the  permutation  size  and  streaming  width).  So,  we  compare  against  designs  for  two  permutations: 
the  stride-by-two  permutation,  which  is  in  the  class  of  least  expensive  problems  supported  by  [20], 
and  the  bit  reversal  permutation,  which  is  in  the  class  of  most  expensive  problems.  Again,  we  evalu¬ 
ate  designs  using  a  random  permutation,  since  the  cost  does  not  depend  on  the  specific  permutation 
being  performed. 

We  synthesize  and  place/route  these  benchmark  designs  using  the  previously  stated  assump¬ 
tions,  and  present  the  results  in  Figure  5.5  for  n  =  512.  Again,  the  plot  shows  throughput  versus 
area  and  includes  the  number  of  BRAMs  for  several  designs.  For  small  values  of  w,  all  three  de¬ 
signs  have  similar  costs.  However,  as  w  increases,  the  amount  of  slices  and  BRAM  required  for 
the  general  method  increases  more  quickly  than  those  from  [20].  If  we  repeat  this  experiment  for 
smaller  problem  sizes  (values  of  n),  the  difference  between  the  proposed  method  and  the  benchmark 
is  reduced;  for  larger  values  of  n,  it  is  increased. 

The  Spiral  hardware  generation  framework  can  implement  both  the  general  streaming  permu¬ 
tation  method  [19]  and  the  bit-matrix  method  [20].  Since  the  bit- matrix  method  results  in  lower 

3  [20]  is  only  able  to  perform  permutations  that  arise  from  invertible  mappings  on  the  bit  representations  of  the  indices, 
such  as  the  bit  reversal  or  stride  permutations.  Of  the  n!  possible  permutations  on  n  points,  [20]  is  able  to  perform 
(2m  -  l)(2m  -  2)  •  •  •  (2m  -  2m_1),  where  n  =  2m  and  only  when  n  and  w  are  powers  of  two. 
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Streaming  Permutations,  n=512  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 

25 

“♦-General  method 
“^Stride  permutation 


Streaming  Permutations,  n  =  4096  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 

25 

^^General  method 
^^Stride  permutation 


Figure  5.5:  Comparison  of  general  streaming  permutation  method  with  bit  matrix  method  for  n  = 
512  and  n  =  4096.  Data  labels  indicate  the  number  of  BRAMs. 


cost,  it  is  used  when:  (1)  the  transform  size  and  streaming  width  are  powers  of  two,  (2)  the  per¬ 
mutation  can  be  represented  as  a  bit  matrix,  and  (3)  only  one  permutation  is  to  be  computed  with 
the  same  structure.  If  these  conditions  are  not  met,  the  general  streaming  method  described  in  this 
chapter  [19]  is  used. 

Although  it  is  not  discussed  in  this  chapter,  the  compiler  also  can  implement  permutations  with 
streaming  width  w  =  1  trivially  using  a  single  memory  (one  read  port  and  one  write  port).  Data  are 
written  into  the  memory  sequentially,  and  a  ROM  is  used  to  encode  the  read  addresses. 


Chapter  6 


Transform  Algorithms 


This  chapter  discusses  the  transform  algorithms  presented  in  Section  2.3  in  the  context  of  streaming 
reuse  and  iterative  reuse,  examining  the  applicability  and  limitations  of  each.  Chapter  7  will  later 
evaluate  implementations  of  these  algorithms  on  FPGA  and  ASIC. 

First,  Section  6.1  discusses  hardware  structures  for  algorithms  for  the  discrete  Fourier  trans¬ 
form,  including  those  that  are  applicable  for  two-power  and  non-two-power  problem  sizes.  Then, 
Section  6.2  explains  algorithms  for  the  two-dimensional  DFT  and  how  they  can  be  mapped  to  hard¬ 
ware  with  more  flexibility  than  the  one-dimensional  version.  Next,  Section  6.3  and  Section  6.4 
discuss  hardware  implementation  of  algorithms  for  the  RDFT  and  DCT  algorithms,  respectively. 


6.1  Discrete  Fourier  Transform 


Pease  FFT.  The  Pease  FFT  algorithm  with  radix  r  is  given  by 


where  C  is  a  diagonal  matrix,  and  L  and  R  are  permutation  matrices.  All  of  its  terms  can  be 
used  with  streaming  reuse:  the  permutations  and  diagonals  as  described  previously,  and  the  tensor 
product  Irt- 1  0  DFTr  can  be  restructured  to  any  width  w  >  r. 

In  general,  larger  values  of  radix  r  reduce  the  computational  cost  (by  reducing  the  number 
of  multiplications  performed).  However,  this  comes  at  the  cost  of  reduced  flexibility,  since  the 
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streaming  width  w  must  be  >  r.  The  evaluations  of  Chapter  7  illustrate  the  effects  of  varying  the 
radix  of  this  and  other  algorithms. 

This  algorithm  is  a  good  candidate  for  iterative  reuse,  since  it  only  depends  on  the  product 
term’s  index  i  in  the  parameter  to  the  diagonal  matrix  C,  which  can  be  implemented  as  described  in 
Chapter  3.2. 1 .  It  can  be  iteratively  reused  with  depth  d  where  d  \  t  (that  is,  d  evenly  divides  t). 

So,  this  algorithm  can  be  tagged  and  restructured  automatically  by  Spiral  to  have  streaming 
width  w  and  depth  d  in  the  following  manner: 


DFTrt 

stream(iu),  depth(d) 


/t/d-l  /d- 1 

inn  StreamPerm(L£  ,  w)  ■  ( Irt jw  <g)sr  (Iw/r  <8>  DFTr)) 
v k= o  \e=o 


■  StreamDiag(C^fi+^,  w)  •  StreamPerm(i?^  ,w). 


Both  permutations  used  here  (stride  permutation  L  and  digit-reversal  permutation  R)  are  imple¬ 
mented  using  the  bit  matrix  method  [20]  when  r  is  a  power  of  two,  as  explained  in  Chapter  5.2. 

For  example,  consider  DFT23 .  First,  the  radix  2  Pease  FFT  gives  the  formula 

[1[l82-(i4®-dft2)-c! \  -i?| 

V=o  / 

which  is  illustrated  as  hardware  with  no  sequential  reuse  or  iterative  reuse  in  Figure  6.1.  The  8  point 
input  vector  x  enters  on  the  left  and  flows  through  the  datapath,  producing  output  vector  y  on  the 
right.  Below  the  figure,  each  portion  of  the  datapath  is  annotated  with  the  matrix  that  describes  its 
computation.  The  shaded  boxes  illustrate  the  three  stages  of  the  product  term. 

Next,  we  can  apply  streaming  reuse  to  this  formula.  For  example,  at  width  w  =  2: 


DFT23  -> 
stream(2) 

(  ]^[  StreamPerm(L2,  2)  ■  (I4  <g)sr  DFT2)  •  StreamDiag(C®,  2)  j  •  StreamPerm(i?2,  2). 

v=o  J 

This  is  illustrated  in  Figure  6.2.  As  before,  the  three  stages  are  enclosed  in  shaded  blocks.  Now,  the 
permutations  are  drawn  as  blocks  with  two  inputs  and  two  outputs,  reflecting  their  streaming  width 
of  w  =  2.  The  multipliers  used  for  the  diagonal  matrices  now  have  lookup  tables  attached,  because 
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x - ►  y 


• — • — • —  — • — • — ^  \—» — • — ^  \—» — • 

dft2  dft2  dft2 


r82  Cf  h  ®  dft2  L\  Cl  h  ®  dft2  L\  Cl  h  ®  dft2  l\ 

Figure  6.1:  Pease  FFT:  DFT23,  no  sequential  reuse. 


x - ►  y 


Figure  6.2:  Pease  FFT:  DFT23,  streaming  width  w  =  2. 


different  constants  are  needed  for  different  portions  of  the  stream.  The  input  data  x  flows  in  at  a 
rate  of  w  =  2  words  per  cycle,  then  flows  through  input  permutation  f?2  before  flowing  through  the 
three  stages  and  exiting  as  y.  A  new  input  vector  can  begin  flowing  into  the  system  as  soon  as  the 
previous  one  has  finished  entering. 

Next,  we  can  apply  iterative  reuse  with  depth  d  =  1. 


DFT2s 

stream(2),  depth(l) 


StreamPerm(L2,  2)  •  (. 1 4  (g)sr  DFT2)  •  StreamDiag(C®,  2) 


^=0 


•  StreamPerm(f?2,  2). 


Figure  6.3  shows  the  resulting  datapath.  Now,  the  three  stages  are  iteratively  reused,  collapsed 
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x - ►  y 


DFT2  Perm 

Perm 

V 

- 1 

Lookup 


Figure  6.3:  Pease  FFT:  DFT23,  streaming  width  w  =  2,  depth  d  =  1. 

into  one  stage  with  a  feedback  path.  The  multiplier’s  lookup  table  is  increased  to  hold  the  constants 
needed  for  all  three  iterations.  The  streaming  permutations  and  basic  blocks  are  identical  in  all  three 
stages,  so  they  can  be  iteratively  reused  with  no  added  cost.  As  before,  input  vector  x  flows  into  the 
system  at  a  rate  of  two  words  per  cycle.  However,  now  a  new  input  vector  cannot  begin  streaming 
in  immediately  after  the  previous  one;  instead  it  must  wait  until  the  first  vector  has  iterated  through. 

Iterative  FFT  Algorithm.  The  variant  of  the  Iterative  Cooley-Tukey  algorithm  considered  in 
this  work  is  given  by 

DFTrt  =  Lr*  ( ®  DFTr)  (lri  <g>  (e^  ■  Qi^))]  Rr\ 

V=o  / 

for  radix  r.  It  contains  permutation  matrices  L,  Q,  and  R  (all  of  which  are  implemented  using  the 
bit  matrix  method  when  r  is  a  power  of  two),  and  diagonal  matrix  E.  Similar  to  the  Pease  algorithm 
above,  this  algorithm  can  easily  be  streamed  with  width  w  >  r. 

However,  this  algorithm  differs  from  the  Pease  FFT  in  an  important  way.  Here,  the  permutation 
and  diagonal  matrices  grow  smaller  and  more  repetitive  as  l  grows  huger.  When  iterative  reuse  is 
not  used,  this  makes  the  Iterative  Cooley-Tukey  FFT  less  expensive  to  implement  than  the  Pease 
FFT,  because  the  cost  of  streaming  permutation  implementation  is  related  to  the  number  of  points 
being  permuted  (r*  in  the  Pease  algorithm  and  rl~l  in  this  algorithm). 

So,  the  Pease  FFT  is  less  expensive  in  iterative  reuse  implementations,  and  the  Iterative  Cooley- 
Tukey  FFT  algorithm  is  cheaper  in  implementations  without  iterative  reuse. 

Mixed-Radix  Algorithm.  The  Mixed-Radix  FFT  algorithm  of  radices  r  and  s  is  given  by 


DFTrts<  = 


*  ■  (Ise  ®BFTrk)  ■  Lrsk/ -TrJs* -(Irk 


»DFT  st)-Lrrk/. 
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This  algorithm  is  typically  used  to  compute  a  transform  at  a  non-native  radix  size  (for  example, 
DFT512  can  be  decomposed  into  DFT2,  which  can  be  implemented  directly,  and  DFT256  which 
can  be  implemented  with  radix  4  or  16).  Additionally,  this  algorithm  is  frequently  used  in  non-two- 
power-sized  problems  if  the  problem  size  is  a  product  of  small  primes  (for  example  DFT432  can  be 
broken  down  into  DFT24  and  DFT33). 

Since  this  algorithm  is  comprised  only  of  permutations  (L),  a  diagonal  matrix  (T),  and  parallel 
DFTs,  streaming  reuse  can  easily  be  applied  with  the  restriction  that  the  streaming  width  w  must 
evenly  divide  rkse  and  it  must  be  a  multiple  of  rs.  For  this  reason,  implementations  of  this  algorithm 
quickly  lose  flexibility  as  the  radix  sizes  increase. 

The  individual  DFTr*  and  DFTS<  blocks  can  be  computed  with  either  the  Pease  or  Iterative 
FFT  algorithms,  depending  on  whether  or  not  iterative  reuse  is  desired  (as  described  above). 

In  principle  it  is  possible  to  use  iterative  reuse  to  collapse  the  two  DFT  stages  into  one  combined 
stage.  However,  the  two  blocks  (DFTrfc  and  I)FT2<)  are  so  different  that  the  overhead  of  doing  so 
is  high. 

Although  the  algorithm  still  holds  when  r  =  s,  there  is  no  benefit  of  using  this  algorithm  over 
the  radix  r  algorithms  considered  above  (Pease  and  Iterative  FFTs). 

Bluestein  Algorithm.  The  Bluestein  algorithm  performs  a  DFT  of  any  given  problem  size  n  by 
performing  scaling  and  a  forward  and  inverse  DFT  of  size  m  where  m  >  2n  — -  1 . 

DFT  —  D*'2'*  ■  TDFT  .nW-DFT  • D ® 

Typically,  this  algorithm  is  used  to  perform  non-power-of-two  size  DFTs,  where  m  is  chosen  to  be  a 
power  of  two.  This  allows  the  bulk  of  the  computation  to  be  performed  using  the  Pease  and  Iterative 
Cooley-Tukey  algorithms. 

Further,  the  forward  and  inverse  DFTn  are  closely  related  in  structure,  so  the  two  stages  of  the 
algorithm  can  be  iteratively  reused  in  the  following  way.  First,  the  formula  is  regrouped: 

DFTn  =  Inxm  ( I]  ■  DFT^) 

v=o 
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where 


m  = 


-1,  £  =  0 

1.  1=1 


and  g(£)  =  < 


2,  £  =  0 

1,  1=1 


So,  this  product  term  can  now  be  iteratively  reused  with  depth  1  if  desired.  Then,  the  inner  DFT 
block  can  be  iteratively  reuse  (using  the  Pease  algorithm)  or  not  (using  the  iterative  Cooley-Tukey 
algorithm). 


6.2  Two-Dimensional  Discrete  Fourier  Transform 


As  shown  in  (2.18),  the  row-column  algorithm  for  the  two-dimensional  DFT  is  given  by 


DFT  -2Dnxn  l[  ((Jn  ®  DFTn)  •  l£ )  . 

£=0 

One  can  use  the  Pease  FFT  (2.13),  the  Iterative  Cooley-Tukey  FFT  (2.15)  or  the  Mixed-Radix 
FFT  (2.16)  to  perform  the  DFTn.  Using  Pease  (with  radix  r): 


DFT 


nxn 


a- i 


[  4®  n  (L?-(In/r®DFTr)-C2*)  -i?  L 


k= o  \  \ 
t  /  ft- 1 


e=o 


I  n«^  ®  L r)  •  ®  DFT-)  •  0  C"))  ■  Vn  ®  i?”)  •  C 


2 

7T  I  5 


/c=0  \  \£=0 


where  f  =  logr(n).  In  order  to  facilitate  discussion  of  this  algorithm,  one  can  write  it  in  a  simplified 
form  while  still  capturing  its  relevant  degrees  of  freedom: 


DFTnxn  =  ( JJ  |  JJ  Pn2{In2/r  ®  DFTr)Sn2  j  Qn2  j  , 

\fc=0  \i=o  )  ) 

where  P  and  Q  are  placeholders  for  the  permutation  expressions  written  above,  and  B  is  a  place¬ 
holder  for  its  corresponding  diagonal  matrix.  Using  the  other  algorithms  (2.15)  and  (2.16)  similar 
expressions  are  obtained. 

These  algorithms  contain  two  iterative  product  terms,  as  seen  above.  Each  may  utilize  iterative 
reuse,  which  is  characterized  with  depth  parameters  d\  and  4  (see  Chapter  3.2).  This  is  illustrated 
in  Figure  6.4. 
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(a)  d\  =  1.  The  outer  product  term  is  iteratively  reused. 


d2  stages 


(b)  d±  =  2.  The  outer  product  term  is  fully  unrolled,  i.e.,  not  iteratively  reused. 

Figure  6.4:  Illustrations  of  DFTnxn  with  outer  product  term  parameterized  by  d\,  inner  product 
term  parameterized  by  d2,  and  streaming  width  w. 

Let  d2  be  the  depth  of  the  inner  product;  d2  can  be  between  1  and  t,  provided  that  t/d2  is  an 
integer.  Each  internal  block  (a  shaded  region  in  Figure  6.4)  consists  of  d2  stages  of  P- (I®DFTr)-D, 
streamed  with  w  ports.  When  d2  =  t,  the  Iterative  Cooley-Tukey  FFT  (2.15)  is  used  to  decompose 
DFTn  (or  the  mixed-radix  algorithm  if  the  problem  size  is  not  a  power  of  the  chosen  radix);  when 
d2  <  t,  the  Pease  FFT  (2.13)  is  used.  Alternatively,  the  Mixed  Radix  FFT  can  be  used  to  decompose 
DFTn  as  well. 

Permutation  Q  can  be  implemented  with  the  bit  matrix  method  when  n  is  a  power  of  two. 

Fet  d\  be  the  depth  of  the  outer  product;  d\  can  be  either  1  or  2.  When  d\  =  1,  the  outer 
product  term  is  iteratively  reused  (Figure  6.4(a)).  When  d\  =  2,  the  outer  term  is  unrolled,  giving 
two  cascaded  stages  as  seen  in  Figure  6.4(b). 
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6.3  Real  Discrete  Fourier  Transform 

RDFT  from  Complex  DFT.  An  RDFT  of  size  n  can  be  computed  from  a  DFT  of  size  n/2  using 
the  following  algorithm 


((K'2n  ®  h)  ■  (ln/4  ®,  A4(£))  ■  ((K?)-1  ®  4))  •  DFTn/2 


RDFTn 


where  K'  is  a  permutation  (thus  (K'  ®  I)  is  also  a  permutation),  A4{£)  is  a  4  x  4  basic  block  that 
depends  on  iteration  variable  l,  and  DFT„/2  represents  an  re  x  n  real  matrix  that  is  constructed  from 
the  re/2  x  re/2  complex  matrix  DFT„/2  as  explained  in  Chapter  2.3.3.  While  computing,  changing 
the  data  vector  from  are/2  point  complex  into  a  re  point  real  vector  is  simply  a  reinterpretation:  no 
logic  is  needed. 

Any  of  the  DFT  algorithms  discussed  above  may  be  used  to  compute  DFTn/2.  As  described 
previously,  the  algorithms  can  be  transposed,  which  moves  the  bit  reversal  permutation  R  to  the 
left  side  of  the  formula,  where  it  can  be  merged  with  the  (Ff2n)_1  0  J2  permutation.  K'  can  not  be 
represented  as  a  bit  matrix,  so  the  general  streaming  method  is  needed. 

This  algorithm  can  be  streamed  with  (real)  streaming  width  w  >  2r  where  r  is  the  (complex) 
radix  of  the  implementation  of  DFTn/2  chosen.  Iterative  reuse  can  be  applied  to  DFTn/2  as  de¬ 
scribed  above. 

This  algorithm  has  higher  arithmetic  cost  than  the  native  RDFT  algorithms  below,  but  in  some 
cases,  it  maps  to  hardware  structures  more  efficiently.  In  Chapter  7,  both  types  of  algorithms  are 
evaluated  over  a  space  of  architectures,  and  the  resulting  tradeoffs  are  discussed. 

Native  RDFT  Algorithm:  Constant  Geometry.  Similar  to  the  Pease  FFT  algorithm,  the  radix 
2 k  RDFT  can  be  computed  with  a  constant  geometry  algorithm: 


where  V  and  L  are  permutations,  and  functions  s  and  /  are  used  to  compute  the  parameter  to  each 
basic  block  rDFT2fc(-). 


This  algorithm  is  considered  to  have  constant  geometry  because  each  of  its  log/,  m  +  1  stages 
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performs  the  same  permutation  So,  this  algorithm  is  ideal  for  iterative  reuse:  the  only  change 

from  stage  to  stage  occurs  in  the  parameter  to  rDFT2fc().  This  parameter  controls  the  scaling 
performed  in  the  basic  block;  the  constants  and  a  portion  of  the  addressing  logic  are  pre-computed 
and  stored  in  lookup  tables;  the  remaining  portion  is  mapped  to  logic. 

This  algorithm  can  be  streamed  with  streaming  width  w  >  2k,  w  |  2 km.  Permutation  V  cannot 
be  represented  as  a  bit  matrix  and  is  thus  implemented  using  the  general  method  [19]. 


Native  RDFT  Algorithm:  Iterative.  A  similar  iterative  RDFT  algorithm  with  radix  2k  is  given 
by 


where  V,  W,  and  L  are  permutations.  V  is  implemented  with  the  general  method  described  in 
Chapter  5.3,  while  W  and  L  are  implemented  using  the  bit  matrix  method  in  [20]. 

This  algorithm  is  similar  to  the  native  RDFT  algorithm  with  constant  geometry  above.  However, 
the  permutation  (here,  W)  depends  on  stage  index  i:  smaller  values  of  i  result  in  smaller  permu¬ 
tations  and  lower  cost.  So,  if  iterative  reuse  is  not  used,  this  algorithm  will  have  lower  cost  than 
the  corresponding  instance  of  the  constant  geometry  algorithm,  which  requires  permutations  of  size 
2 km  in  each  stage.  This  tradeoff  is  analogous  with  the  tradeoff  between  the  Pease  and  Iterative 
Cooley-Tukey  algorithms  for  the  complex  DFT,  as  explained  above  in  Chapter  6.1. 

6.4  Discrete  Cosine  Transform 

This  thesis  considers  an  algorithm  for  the  discrete  cosine  transform  of  type  two: 


(s) 

where  S'2k  represents  one  computational  stage: 


S$  =  (/2*-i  M^s/))  •  diag (gk{£,  s))  •  (J2*_2  (g>,  h[s))  •  (J2*_ i  ®  F2) 
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and  U,  PH,  and  L  are  permutations.  PH  and  L  can  be  implemented  using  the  bit  matrix  method  if 
the  problem  size  is  a  power  of  two,  but  U  must  always  be  implemented  using  the  general  streaming 
method. 

Each  of  the  terms  in  this  algorithm  easily  maps  to  streaming  with  width  w  >  4.  If  the  H.\  matrix 
is  treated  as  a  streaming  permutation  instead  of  a  basic  block,  the  algorithm  may  be  streamed  with 
w  =  2,  although  this  gives  added  overhead. 

As  the  algorithm  goes  from  left  to  right,  the  permutation  (Iwk-3- 1  0  Liy 1 ' )  grows  more  local, 
and  thus  less  expensive  to  perform  on  streaming  data.  The  product  term  can  be  mapped  to  iterative 
reuse,  however  this  dependence  of  the  permutation  on  the  stage  number  s  means  that  added  overhead 
would  be  required  to  support  all  k  permutations  within  one  memory  block  (see  discussion  at  the  end 
of  Chapter  5.3.1).  Instead,  other  constant  geometry  algorithms  could  be  used  to  support  iterative 
reuse  with  less  overhead  (for  example,  [15]). 


Chapter  7 


Evaluation 


This  chapter  presents  a  set  of  experiments  evaluating  the  designs  generated  by  Spiral  based  on  the 
methods  described  in  Chapter  4.  The  results  shown  here  demonstrate  the  following  aspects  of  the 
Spiral-generated  hardware  implementations: 

1 .  Cost/performance  tradeoff.  Hardware  cores  generated  by  Spiral  span  a  tradeoff  space  be¬ 
tween  cost  (in  resources  required)  and  performance  (typically  throughput).  Users  can  then 
make  a  choice  along  the  cost/performance  tradeoff  space  to  pick  the  design  that  best  meets  a 
particular  application’s  needs. 

2.  Algorithmic  and  architectural  freedoms.  The  freedoms  allowed  by  this  system  enrich  the 
cost/performance  tradeoff  space.  That  is,  the  flexibility  at  both  the  algorithmic  level  and  data¬ 
path  mapping  level  contribute  to  the  set  Pareto  set  of  best  designs  produced  by  the  generation 
tool. 

3.  Quality  of  designs.  The  cost  and  performance  of  the  designs  automatically  generated  by 
Spiral  are  comparable  to  that  of  existing  platform-tuned  benchmarks,  when  available. 

4.  Target  flexibility.  The  generation  system  allows  flexibility  in  both  the  implementation  target 
(FPGA,  ASIC)  and  data  type  (fixed  point,  floating  point). 

5.  Problem  flexibility.  Spiral  is  able  to  generate  implementations  for  multiple  transforms,  in¬ 
cluding  the  DFT  (both  with  two-power  and  non-two-power  problem  size),  DCT,  RDFT,  and 
multi-dimensional  transforms. 
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The  remainder  of  this  chapter  is  organized  as  follows.  First,  Section  7.1  describes  the  evalu¬ 
ation  methodology  used  for  this  set  of  experiments,  including  FPGA  and  ASIC  synthesis.  Then, 
Section  7.2  gives  a  detailed  design  space  exploration  example  including  evaluation  of  the  effects 
of  different  design  parameters.  Next,  Section  7.3  shows  results  for  implementations  of  the  DFT 
on  FPGAs  including  fixed  and  floating  point  data  formats  and  both  two-power  and  non-two-power 
problem  sizes.  Then,  Section  7.4  shows  results  of  an  ASIC  evaluation  of  fixed  and  floating  point 
DFTs,  and  looks  at  power  and  area  versus  performance.  Further,  frequency  scaling  is  discussed  in 
Section  7.4.2  and  Section  7.4.3  discusses  joint  power/area  optimization  under  a  fixed  performance 
constraint.  Lastly,  Section  7.5  shows  example  results  from  other  transforms,  and  Section  7.6  sum¬ 
marizes  and  provides  some  concluding  remarks. 

7.1  Methodology 

The  compilation  flow  described  in  Chapter  4  has  been  implemented  in  the  following  way.  First, 
the  algorithm  generation,  formula  rewriting,  and  basic  block  matrix  compilation  stages  have  been 
built  inside  of  Spiral.  The  tools  to  factor  streaming  permutations  as  described  in  Chapter  5  arc  built 
partially  within  Spiral  and  partially  as  a  separate  C  program  based  on  the  MiniSat  SAT  solver  [31], 
which  is  used  to  perform  the  factorization  (5.3).  Lastly,  a  Verilog  backend  written  in  Java  takes 
as  input  a  hardware  formula  and  outputs  synthesizable  register-transfer  level  Verilog.  The  entire 
compilation  flow  is  initiated  and  run  through  Spiral. 

All  of  the  experiments  here  are  conducted  independent  of  the  data  bandwidth  that  may  be  avail¬ 
able  to  the  implemented  core.  This  is  done  because  the  implementations  produced  by  Spiral  can 
be  integrated  in  different  types  of  systems.  In  some,  all  data  may  be  produced  and  consumed  on 
chip,  potentially  allowing  a  large  amount  of  data  to  be  processed  by  the  core.  Other  systems  may 
require  input  and  output  data  to  be  transferred  on  and  off  chip  before  and  after  each  computation; 
the  throughput  of  these  systems  may  be  limited  by  available  bandwidth. 

The  bandwidth  required  for  each  design  presented  in  the  experiments  presented  below  can  easily 
be  calculated.  Performance  in  these  experiments  is  expressed  as  throughput  in  billions  of  samples 
per  second.  For  a  design  running  at  t  billion  samples  per  second  with  a  data  type  requiring  b 
bytes  per  sample,  the  total  amount  of  bandwidth  (in  gigabytes  per  second)  needed  is  given  by  2bt. 
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For  example,  some  of  the  experiments  below  consider  16  bit  fixed  point  data  types.  For  complex 
transforms  (such  as  DFT),  this  means  4  bytes  are  required  per  sample.  So,  a  throughput  of  1  billion 
samples  per  second  would  require  a  total  of  8  gigabytes  per  second  (4  for  input  and  4  for  output). 

7.1.1  FPGA 

This  section  describes  the  procedures  used  in  the  FPGA  (field-programmable  gate  array)  evaluations 
described  in  this  chapter.  All  FPGA  experiments  done  target  the  Xilinx  Virtex-6  XC6VLX760  or 
the  Xilinx  Virtex-5  XC5YLX330  FPGAs  (hereafter  referred  to  as  Virtex-6  and  Virtex-5).  All  FPGA 
synthesis  is  performed  using  Xilinx  ISE  version  12.0,  and  the  area  and  timing  data  shown  in  the 
results  are  extracted  after  the  final  place  and  route  are  complete. 

In  order  to  determine  the  maximum  clock  frequency  obtainable  for  each  design,  it  is  often 
necessary  to  iterate  over  multiple  design  options  and  target  frequencies.  This  is  done  using  the 
Xilinx-provided  Xplorer  script,  configured  to  use  a  maximum  of  four  place  and  route  runs  to  find 
the  best  frequency. 

Arithmetic  units.  The  experiments  detailed  here  use  two  data  types.  For  fixed  point  data  types, 
the  numeric  computations  are  simply  specified  using  the  +  and  *  operators.  Single  precision  floating 
point  is  implemented  using  IP  cores  generated  using  the  Xilinx  LogiCore  IP  tool. 

The  Xilinx  FPGAs  contain  dedicated  arithmetic  units  called  DSP  slices  (DSP48E  on  the  Virtex- 
5  and  DSP48E1  on  the  Virtex-6).  DSP  slices  contain  hard  multipliers,  accumulators,  registers,  and 
interconnect.  The  multipliers  in  these  slices  are  used  in  fixed  point  and  floating  point  multiplication. 
Additionally,  the  multipliers  are  used  as  shifters  for  floating  point  addition.  The  single  precision 
floating  point  adder  and  multiplier  each  use  two  DSP  slices,  while  the  16  bit  fixed  point  multiplier 
uses  one  DSP  slice. 

Memory.  Memories  can  be  mapped  to  two  types  of  structure  on  Xilinx  FPGAs:  block  RAM 
(BRAM)  or  distributed  RAM.  BRAMs  are  36kb  dedicated  hard  memories  built  into  the  FPGA.  The 
Virtex-6  contains  720  BRAMs;  the  Virtex-5  contains  288.  Additionally,  memory  structures  can  be 
constructed  out  of  the  FPGA’s  lookup  tables  (LUTs)  which  are  normally  used  as  logic. 

The  Spiral  hardware  compiler  can  choose  between  distributed  and  block  RAM  by  outputting 
directives  for  the  Xilinx  ISE  tool.  The  user  can  control  how  Spiral  decides  whether  a  memory 
should  be  implemented  as  block  or  distributed  RAM  using  two  tunable  methods.  First,  a  budget- 
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based  method  allows  the  user  to  specify  the  maximum  number  of  BRAM  to  use.  Then,  the  tool 
globally  sorts  all  memory  structures  needed  in  the  design  by  size;  it  chooses  which  structures  to 
map  into  the  budgeted  number  of  BRAMs  in  order  to  maximize  utilization. 

Secondly,  a  threshold  option  allows  the  user  to  specify  a  threshold  memory  size  (in  bytes).  Any 
memory  structure  of  that  size  or  larger  is  then  mapped  to  BRAM.  For  example,  the  user  could  say 
only  to  utilize  a  BRAM  if  the  memory  structure  implemented  would  use  8  kilobits  of  it. 

The  experiments  presented  in  this  chapter  use  the  budget  method  with  a  budget  of  256  BRAMs. 
This  represents  approximately  35%  of  the  BRAM  on  the  Virtex-6  FPGA.  For  the  designs  considered 
here,  this  budget  exceeds  the  memory  requirement  of  the  system;  distributed  RAM  is  not  used  for 
large  memories. 

7.1.2  ASIC 

This  chapter  also  contains  a  set  of  experiments  evaluating  the  Spiral-generated  cores  when  syn¬ 
thesized  for  an  application-specific  integrated  circuit  (ASIC).  ASIC  synthesis  is  performed  using 
Synopsys  Design  Compiler  version  C-2009.06-SP5,  and  targets  a  commercial  65nm  standard  cell 
library.  Area,  timing  and  power  estimates  are  obtained  from  Design  Compiler  post-synthesis. 

Arithmetic  units.  Synopsys  Design  Ware  is  used  to  construct  hardware  for  addition,  subtraction, 
and  multiplication  (for  both  fixed  point  and  floating  point)  in  the  following  way.  For  each  arithmetic 
unit,  a  combinational  adder/subtractor/multiplier  is  instantiated,  and  pipelining  is  then  performed  by 
Design  Compiler:  a  number  of  registers  are  placed  before/after  the  arithmetic  unit,  and  tool  is  run 
in  retiming  mode.  Then,  synthesized  results  are  saved  in  binary  format  and  are  instantiated  in 
the  Spiral-generated  designs.  Different  levels  of  pipelining  are  needed  for  different  data  types  and 
different  frequencies.  The  number  of  stages  needed  can  be  determined  for  a  given  frequency  by 
finding  the  fewest  number  of  registers  needed  to  meet  timing.  Table  7.1  summarizes  these  results 
for  16  and  32  bit  fixed  point,  and  single  precision  floating  point. 

Memory.  No  memory  compiler  was  accessible  for  the  particular  standard  cell  library  used  in 
these  evaluations.  So,  CACTI  version  6.5  [32],  a  memory  estimation  tool,  is  used  to  estimate  power 
and  area  for  all  required  memories. 


77 


Chapter  7.  Evaluation 
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Table  7.1:  Latency  (cycles)  of  arithmetic  operators  at  given  frequencies. 


7.2  Design  Space  Exploration  Example 


In  order  to  examine  how  different  design  space  freedoms  (algorithm,  radix,  iterative  reuse,  stream¬ 
ing  reuse)  contribute  to  the  space  of  designs,  this  section  provides  a  detailed  evaluation  of  Spiral¬ 
generated  implementations  of  DFT256  and  DFT1024  on  a  Xilinx  Virtex-6  FPGA.  Figure  7.1  shows 
throughput  (in  billion  samples  per  second)  versus  area  (given  in  slices)  on  the  Xilinx  Virtex-6 
FPGA.  Given  a  fixed  problem  size  n,  performance  (in  pseudo-gigaoperations  per  second,  assuming 
5 n  log2  n  operations  per  DFTn)  is  proportional  to  throughput;  the  right-hand  y-axis  shows  perfor¬ 
mance  for  each  design. 

Different  data  markers  are  used  to  illustrate  the  different  parameters:  an  algorithm’s  radix  r  and 
the  implementation’s  depth  d  (for  iterative  reuse).  The  diamond-shaped  markers  indicate  designs 
with  full  IR  (iterative  reuse),  and  the  triangular  markers  indicate  partial  IR.  Lastly,  designs  marked 
with  circles  and  squares  contain  no  IR.  For  each  series,  the  streaming  width  w  starts  from  w  =  r 
and  goes  up  to  32. 

As  discussed  above  in  Chapter  6.1,  Spiral  generates  hardware  cores  for  the  DFT  using  different 
algorithms,  depending  on  the  situation.  In  Figure  7.1,  the  algorithm  can  be  determined  based  on  the 
parameters  displayed.  First,  the  Pease  FFT  algorithm  (2.13)  is  used  when  full  IR  is  used  (that  is, 
d  =  1).  So,  for  DFT256,  the  black,  gray,  and  white  diamonds  represent  the  Pease  FFT  with  radix 
r  =  2,4,  and  16,  respectively.  For  DFT1024  the  same  data  markers  represent  the  same  algorithm 
with  radix  r  =  2,4,  and  8. 

When  IR  is  not  used  (represented  as  circles  and  squares),  two  algorithms  are  used,  depending 
on  the  radix  r:  the  Iterative  FFT  (2.15)  and  the  Mixed-Radix  FFT  (2.16).  If  the  problem  size  n  can 
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DFT  256  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 


12 


performance  [Gop/s] 
480 


DFT  1024  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 

12 


performance  [Gop/s] 
600 


Figure  7.1:  Exploring  DFT256  and  DFT1024,  fixed  point,  FPGA,  throughput  versus  slices.  Data 
labels  indicate  the  number  of  block  RAMs  and  DSP48E1  slices  required.  Data  markers  are  used  to 
illustrate  values  of  radix  r  and  depth  d. 
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be  represented  as  n  =  rk  for  integer  k,  then  the  Iterative  FFT  algorithm  is  used  directly.  If  that  is 
not  the  case,  then  the  Mixed-Radix  algorithm  decomposes  the  problem  into  two  stages:  one  that  can 
be  expressed  as  a  power  of  radix  r,  and  one  smaller  stage:  n  =  rk  ■  n/  (rk).  For  example,  DFT1024 
can  be  computed  with  and  three  radix  8  stages  and  one  radix  2  stage:  1024  =  83  •  2. 

This  evaluation  omits  designs  that  use  the  Mixed-Radix  algorithm  with  IR  (iterative  reuse)  de¬ 
signs  (for  example,  implementing  DFT1024  with  an  iteratively-reused  implementation  of  radix  8 
DFT512  and  one  stage  of  DFT2),  although  designs  of  this  kind  are  easily  generated  using  the  pro¬ 
posed  system. 

The  black  line  indicates  the  Pareto  optimal  set  of  designs  for  each  graph.  These  designs  are 
those  that  give  the  best  tradeoff  between  the  two  metrics  being  considered  (here,  throughput  and 
area).  So,  if  a  designer  only  cared  about  those  two  measurements,  the  points  along  the  line  are 
the  only  ones  that  he  or  she  must  consider.  However,  other  costs  are  associated  with  FPGA  imple¬ 
mentations:  required  memory  (number  of  block  RAMs  or  BRAMs),  and  dedicated  arithmetic  units 
called  DSP48E1  slices.  Several  points  in  each  graph  are  annotated  with  the  number  of  BRAMs  (first 
number)  and  DSP  slices  (second  number)  required. 

The  set  of  designs  considered  cover  a  wide  tradeoff  space.  For  DFT256,  the  fastest  design  is 
76  x  faster  than  the  slowest,  but  it  requires  33  x  the  number  of  slices,  16  x  the  number  of  BRAMs, 
and  59x  the  number  of  DSP  slices.  Similarly,  the  fastest  DFT1024  implementation  is  132x  faster 
than  the  slowest,  but  it  uses  49x  the  slices,  96x  the  BRAM,  and  107x  the  DSP  slices. 

In  each  graph,  the  smallest  and  slowest  design  is  an  IR  core  with  depth  d  =  1  generated  from  the 
Pease  FFT  with  radix  r  =  2  and  streaming  width  w  =  2.  From  there,  the  black  diamonds  illustrate 
the  same  algorithm,  radix,  and  depth,  with  increasing  streaming  width:  w  =  2,4,  8, 16,  32.  As  the 
streaming  width  increases  (following  the  line  of  black  diamonds),  the  designs  increase  in  throughput 
and  in  cost.  Quickly,  the  radix  2  IR  designs  become  Pareto-suboptimal.  The  next  Pareto-optimal 
point  is  r  =  2,  w  =  2,  and  d  =  2;  now  the  IR  product  term  has  twice  the  depth.  Of  the  designs  with 
partial  IR,  only  this  point  contributes  to  the  Pareto  front. 

Next,  the  gray  diamonds  show  the  d  =  1  designs  from  the  Pease  radix  4  algorithm,  with  width 
w  =  4, 8, 16,  32.  In  both  plots,  the  IR  radix  4  points  contribute  just  one  design  to  the  Pareto 
front.  Following  that,  the  circles  and  squares  represent  fully  streaming  designs  (those  without  IR), 
which  provide  the  rest  of  the  Pareto  front  (while  the  IR  designs  previously  discussed  provide  less 
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extra  throughput  given  the  extra  area  cost).  So,  the  IR  designs  provide  the  small/slow  end  of  the 
Pareto  optimal  front,  but  as  they  scale  larger,  their  performance  per  area  is  quickly  eclipsed  by  the 
streaming  designs. 

Further  interesting  comparisons  can  be  made  by  examining  the  behavior  of  different  radices.  For 
example,  in  the  DFT256  experiment,  there  are  three  points  near  the  Pareto  front  at  approximately  3.5 
billion  samples  per  second  and  approximately  2,000  slices.  Each  of  these  designs  is  fully  unrolled 
and  streamed  with  width  w  =  8;  their  only  difference  is  in  radix  r  (2,  4,  or  8).  All  three  points  are 
similar  in  throughput  and  area,  however  the  data  labels  indicate  that  they  require  differing  amounts 
of  BRAM  and  DSP  slices.  The  least  expensive  of  the  three  is  the  radix  4  design,  which  requires 
only  16  BRAM  and  60  DSP  slices. 

It  may  seem  counterintuitive  that  the  radix  4  design  can  provide  better  results  here  than  the 
radix  8,  since  in  general  higher  radices  lead  to  lower  arithmetic  and  permutation  costs.  However, 
the  difference  here  lies  in  algorithm:  at  radix  8,  the  overhead  of  using  the  Mixed  Radix  algorithm 
to  decompose  DFT256  into  DFT4  and  DFT82  offsets  the  savings  of  using  a  radix  8  algorithm  for 
a  portion  of  the  computation.  In  the  DFT1024  plot,  the  equivalent  radix  4  and  8  designs  are  much 
closer:  the  radix  8  design  uses  slightly  more  area  but  fewer  BRAMs  and  DSP  slices.  The  Mixed- 
Radix  algorithm  is  most  useful  when  the  problem  size  is  such  that  r  =  2  is  the  only  natural  radix 
supported  (for  example,  Mixed-Radix  FFT  is  necessary  to  compute  DFT128  with  any  radix  other 
than  2). 

7.3  Discrete  Fourier  Transform  on  FPGA 

This  section  shows  results  of  a  detailed  study  of  implementations  of  the  discrete  Fourier  transform 
on  FPGA.  Section  7.3.1  discusses  fixed  point  implementations,  Section  7.3.2  discusses  floating 
point  implementations,  and  Section  7.3.3  shows  implementations  when  the  problem  size  is  not  a 
power  of  two. 

7.3.1  Fixed  Point 

Figures  7.2  and  7.3  show  throughput  (in  billion  samples  per  second)  versus  area  (in  slices)  of  the 
discrete  Fourier  transform  of  size  64,  256,  1024,  and  4096  with  16  bit  fixed  point  data  format  on 
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the  Xilinx  Virtex-6  FPGA.  The  secondary  axis  presents  performance  in  pseudo  gigaoperations  per 
second  calculated  for  all  algorithms  assuming  5 n  log2  n  operations  per  n  point  transform.  Here, 
gray  diamonds  represent  Spiral-generated  designs  across  the  degrees  of  freedom  discussed  in  Chap¬ 
ter  7.2.  The  solid  line  connects  the  Spiral-generated  Pareto  optimal  points.  Lastly,  the  black  circles 
represent  the  four  designs  from  the  Xilinx  LogiCore  FFT  IP  library  [33].  These  designs  were  imple¬ 
mented  using  LogiCore  and  evaluated  using  the  same  Xilinx  tools  as  the  Spiral-generated  designs. 

As  discussed  in  Chapter  7.2,  a  number  of  design  freedoms  are  used  along  the  Pareto  front, 
including  designs  with  and  without  iterative  reuse,  and  designs  with  streaming  width  w  up  to  32. 

In  all  experiments  shown  here,  the  Xilinx  IP  cores  closely  match  the  cost/performance  of  the 
smallest  Spiral-generated  designs.  However,  as  more  slices  are  allowed  to  be  consumed,  the  Spiral¬ 
generated  designs  are  able  to  obtain  a  commensurate  increase  in  throughput.  For  example,  for  the 
DFT1024  study,  the  largest/fastest  design  requires  49x  the  slices,  96x  the  BRAM,  and  107x  the 
DSP48E1  slices  of  the  smallest/slowest  design  point,  but  has  132x  the  throughput.  Compared  with 
the  largest/fastest  design  in  the  Xilinx  LogiCore  library,  the  fastest  Spiral-generated  design  uses 
1 1.8x  the  slices,  19. 2x  the  BRAM,  and  18. 8x  the  DSP  slices  while  providing  16.5x  throughput. 

Similar  trends  are  seen  across  the  experiments  as  the  problem  size  increases.  However,  the 
absolute  area  requirements  increase  with  the  problem  size.  This  is  primarily  caused  by  the  added 
number  of  stages  in  designs  without  iterative  reuse,  and  in  the  added  cost  associated  with  larger 
permutations  and  diagonal  matrices.  Also,  it  is  important  to  note  that  although  the  throughput  stays 
roughly  the  same  as  the  problem  size  increases,  the  performance  (operations  /  second)  increases. 

If  latency  were  used  as  the  performance  metric  instead  of  throughput,  similar  trends  would  also 
be  seen,  except  the  designs  without  iterative  reuse  (the  higher  performance/cost  designs)  would 
exhibit  a  performance  penalty  because  they  are  optimized  for  throughput:  different  portions  of  the 
datapath  process  different  data  vectors  at  once.  The  iterative  reuse  designs  are  in  a  sense  optimized 
for  latency-based  computation  because  they  only  employ  a  small  amount  of  overlapping  of  multiple 
problems. 

7.3.2  Floating  Point 

Next,  Figures  7.4  and  7.5  repeat  the  same  set  of  DFT  experiments  on  FPGA  using  single  precision 
floating  point  (32  bit).  Again,  the  x-axis  shows  FPGA  area  in  terms  of  slices,  and  y-axis  shows 
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DFT  64  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second]  performance  [Gop/s] 

12  360 


DFT  256  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second]  performance  [Gop/s] 

12  480 


Figure  7.2:  DFT64  and  DFT256,  fixed  point,  FPGA,  throughput  versus  slices  compared  with  Xilinx 
LogiCore. 
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DFT  1024  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second]  performance  [Gop/s] 

12  600 


DFT  4096  (16  bit  fixed  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second]  performance  [Gop/s] 

12  720 


Figure  7.3:  DFT1024  and  DFT4096,  fixed  point,  FPGA,  throughput  versus  slices  compared  with 
Xilinx  LogiCore. 
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throughput  in  billions  of  samples  per  second  (left)  and  performance,  now  in  billions  of  floating 
point  operations  per  second  (Gigaflop/s)  on  the  right. 

The  Xilinx  LogiCore  FFT  [33]  does  not  provide  a  full  floating  point  implementation.  Rather, 
it  provides  an  implementation  where  the  input  and  output  are  given  in  single-precision  floating 
point,  but  internally,  computations  are  performed  using  a  fixed  point  implementation  with  twiddle 
factors  stored  as  24  or  25  bit  fixed  point  values.  This  results  in  arithmetic  units  that  are  much  less 
expensive  to  implement  than  the  full  floating  point  implementations  generated  by  Spiral.  The  four 
architectures  provided  by  Xilinx  LogiCore  (with  25  bit  twiddle  factors)  are  shown  as  black  circles 
in  Figures  7.4  and  7.5.  Although  the  Xilinx  designs  have  lower-cost  fixed  point  arithmetic  units, 
they  compare  closely  with  the  Spiral-generated  designs.  The  biggest  difference  can  be  seen  on  the 
largest/fastest  of  the  four  Xilinx  designs. 

Additionally,  for  all  but  DFT64,  a  data  point  corresponding  to  the  4DSP  floating  point  FFT 
processor  [34]  is  shown  (black  triangle).  This  is  a  commercially-available  IP  core  that  is  capable 
of  performing  the  DFT  on  a  range  of  input  sizes.  That  is,  it  is  a  floating  point  DFT  processor  with 
runtime-configurable  length.  Area  and  performance  numbers  were  estimated  based  on  data  given 
in  its  data  sheet  [34].  As  shown,  the  4DSP  core  has  a  considerable  area  overhead  relative  to  the 
Spiral-generated  or  Xilinx  designs.  Flowever,  its  runtime  configurability  leads  to  added  overhead, 
so  a  direct  comparison  cannot  be  made. 

Comparing  the  cost  and  performance  of  the  Spiral-generated  floating  point  designs  (Figures  7.4 
and  7.5)  to  the  fixed  point  implementations  (Figures  7.2  and  7.3),  the  main  difference  lies  in  the  base 
cost  of  computation.  This  leads  to  a  much  higher  cost  to  achieve  the  same  throughput  in  a  floating 
point  implementation  as  a  fixed  point  implementation.  This  cost  is  also  affected  by  added  overhead 
in  permutations,  diagonal  matrices  (ROMs  to  store  constants)  and  registers  within  the  system,  which 
now  must  store  32  bits  instead  of  16. 

Further,  the  higher  base  cost  reduces  the  space  of  feasible  architectural  options.  In  the  fixed 
point  experiments,  a  streaming  width  up  to  32  complex  words  per  cycle  was  implemented.  Here, 
the  maximum  streaming  width  achieved  (when  iterative  reuse  is  not  used)  is  16  words  per  cycle. 
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DFT  64  (floating  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 


performance  [Gfiop/s] 
150 


DFT  256  (floating  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 


performance  [Gflop/s] 
200 


Figure  7.4:  DFT64  and  DFT256,  floating  point,  FPGA,  throughput  versus  slices  compared  with 
Xilinx  LogiCore  and  4DSP  FFT  cores. 
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DFT  1024  (floating  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 


performance  [Gflop/s] 
250 


DFT  4096  (floating  point)  on  Xilinx  Virtex-6  FPGA 

throughput  [billion  samples  per  second] 


performance  [Gflop/s] 
300 


240 


Figure  7.5:  DFT1024  and  DFT4096,  floating  point,  FPGA,  throughput  versus  slices  compared  with 
Xilinx  LogiCore  and  4DSP  FFT  cores. 
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7.3.3  DFT  with  Non-Power-of-Two  Problem  Size 

When  the  problem  size  n  of  DFTn  is  not  a  power  of  two,  a  variety  of  algorithms  (discussed  in 
Chapter  6.1)  can  be  utilized.  Here,  four  values  of  n  are  considered  that  illustrate  four  different 
classes  of  problem  size.  Figures  7.6  and  7.7  examine  the  throughput  versus  area  for  each  design. 
Data  labels  are  used  to  indicate  the  number  of  Virtex-5  DSP  slices  consumed  by  each  point  in  the 
Pareto-optimal  set.  This  evaluation  was  published  in  part  in  [35]. 

Large  prime  size.  First,  Figure  7.6  (top)  shows  throughput  versus  area  for  implementations  of 
DFT499.  Because  499  is  a  large  prime  number,  the  only  applicable  option  among  those  considered 
is  the  Bluestein  FFT  algorithm.  Each  black  diamond  represents  one  Bluestein-based  datapath,  and 
the  black  line  illustrates  the  designs  in  the  Pareto  optimal  set.  The  slowest  design  requires  approx¬ 
imately  1,500  slices  and  has  a  throughput  of  1 1  million  samples  per  second.  The  fastest  is  about 
18  x  larger  but  60  x  faster. 

Composite  number  with  larger  prime  factors.  Next,  Figure  7.6  (bottom)  shows  results  for 
DFT405.  In  addition  to  the  Bluestein  algorithm,  the  mixed  radix  FFT  algorithm  can  be  applied  to 
this  problem;  it  decomposes  DFT405  into  DFT34  and  DFT5.  This  enables  two  additional  designs 
(with  streaming  width  3  •  5  =  15),  shown  as  gray  circles.  Although  the  larger  of  these  designs 
provides  a  much  higher  throughput  than  a  similarly  sized  Bluestein-derived  design,  it  is  important 
to  note  that  the  cross-over  point  (where  the  mixed-radix  designs  become  the  better  choice)  is  large 
(at  approximately  15,000  slices).  So,  if  design  requirements  require  a  smaller,  lower  throughput 
core,  the  Bluestein  algorithm  is  a  better  choice. 

Composite  number  with  smaller  prime  factors.  Third,  Figure  7.7  (top)  illustrates  throughput 
and  area  for  DFT432.  Similar  to  the  previous  example,  DFT432  can  be  implemented  using  the 
mixed  radix  algorithm.  However,  unlike  DFT405,  here  DFT432  can  be  decomposed  into  smaller 
radices:  2  and  3  (because  432  =  24  •  33).  Since  the  radices  are  smaller,  the  mixed  radix  algorithm 
can  be  used  with  more  options  for  streaming  width,  yielding  a  larger  set  of  Pareto-optimal  designs 
than  for  DFT405.  The  Bluestein  designs  are  still  the  best  choice  for  the  smallest/slowest  designs, 
but  the  cross-over  point  (about  9,000  slices)  is  lower  than  in  the  previous  example. 

Power  of  small  prime.  Lastly,  Figure  7.7  (bottom)  shows  results  for  DFT243  =  DFT35.  Be¬ 
cause  the  problem  size  is  a  power  of  three,  the  radix  3  Pease  and  Iterative  FFTs  are  applied  directly 
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DFT  499  (16  bit  fixed  point)  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 

0.7 


0  5,000  10,000  15,000  20,000  25,000 
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DFT  405  (16  bit  fixed  point)  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 
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Figure  7.6:  DFT499  and  DFT405,  fixed  point,  FPGA,  throughput  versus  slices. 
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DFT  432  (16  bit  fixed  point)  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 

3.5 
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DFT  243  (16  bit  fixed  point)  on  Xilinx  Virtex-5  FPGA 

throughput  [billion  samples  per  second] 
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Figure  7.7:  DFT432  and  DFT243,  fixed  point,  FPGA,  throughput  versus  slices,  compared  with 
Xilinx  LogiCore  DFT. 
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(shown  as  gray  circles).  Here,  the  cores  built  with  the  radix  3  algorithms  surpass  the  performance 
of  the  Bluestein  cores  at  a  much  lower  area  than  in  the  other  problems,  since  the  single-radix  design 
can  be  built  with  less  logic  than  in  the  previous  multi-radix  designs. 

Comparison.  The  Xilinx  LogiCore  library  contains  a  non-two-power  sized  DFT  core,  which 
is  a  processor  that  performs  a  number  of  non-two-power-sized  DFT  computations,  and  is  run-time 
configurable  in  transform  length.  Of  the  example  sizes  considered  in  this  evaluation,  the  LogiCore 
DFT  can  only  perform  DFT432.  Its  performance  and  area  are  given  as  a  white  square  in  Figure  7.7 
(top). 

7.4  Discrete  Fourier  Transform  on  ASIC 

As  described  above  in  Chapter  7.1.2,  Spiral-generated  hardware  cores  can  be  synthesized  for  a 
commercial  65nm  standard  cell  library.  First,  Section  7.4.1  presents  a  baseline  evaluation  of  designs 
with  maximum  clock  frequency.  Then,  Section  7.4.2  examines  frequency  scaling  and  its  effect  on 
performance,  power,  and  area.  Lastly,  Section  7.4.3  uses  frequency  scaling  to  implement  a  set  of 
designs  that  meet  a  given  throughput  target  while  allowing  a  tradeoff  between  power  and  area. 

7.4.1  Baseline:  Maximum  Frequency 

First,  this  section  presents  an  evaluation  of  the  DFT  on  ASIC.  Here,  designs  are  synthesized  with 
the  goal  of  maximizing  the  clock  frequency.  Figures  7.8-7. 1 1  show  throughput  versus  power  and 
area  for  DFTg4,  DFT256,  DFT1024  and  DFT4096  using  16  bit  fixed  point  data.  Figures  7.12-7.15 
repeat  the  same  experiments  utilizing  single  precision  floating  point. 

In  these  plots,  the  y-axis  represents  throughput  in  billions  of  samples  per  second  (left)  and 
billions  of  operations  (or  floating  point  operations)  per  second.  Additionally,  two  data  markers  are 
used:  black  diamonds  for  cores  with  iterative  reuse,  and  gray  circles  for  fully  streaming  designs, 
(that  is,  those  without  iterative  reuse).  A  gray  line  is  used  to  highlight  the  Pareto  optimal  set. 

Similar  to  the  FPGA  results,  here  the  iterative  reuse  designs  provide  the  least  expensive  (smallest 
and  lowest  power)  but  slowest  designs.  For  example,  for  DFT1024  with  fixed  point  data,  the  fastest 
design  is  approximately  lOOx  faster  than  the  slowest,  but  requires  1 1  x  the  area  and  24  x  the  power. 
Also  similar  to  the  FPGA  experiments,  designs  with  iterative  reuse  comprise  a  larger  portion  of  the 
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DFT  64  (16  bit  fixed  point)  on  ASIC  (65nm) 

throughput  [billion  samples  per  second] 
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performance  [Gop/s] 
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DFT  64  (16  bit  fixed  point)  on  ASIC  (65nm) 
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Figure  7.8:  DFT64  throughput  versus  power  (top)  and  area  (bottom),  fixed  point  on  65nm  ASIC. 
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DFT  256  (16  bit  fixed  point)  on  ASIC  (65nm) 
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DFT  256  (16  bit  fixed  point)  on  ASIC  (65nm) 
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Figure  7.9:  DFT256  throughput  versus  power  (top)  and  area  (bottom),  fixed  point  on  65nm  ASIC. 
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DFT  1024  (16  bit  fixed  point)  on  ASIC  (65nm) 
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DFT  1024  (16  bit  fixed  point)  on  ASIC  (65nm) 
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Figure  7.10:  DFT1024  throughput  versus  power  (top)  and  area  (bottom),  fixed  point  on  65nm  ASIC. 
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DFT  4096  (16  bit  fixed  point)  on  ASIC  (65nm) 
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DFT  4096  (16  bit  fixed  point)  on  ASIC  (65nm) 
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Figure  7.11:  DFT4096  throughput  versus  power  (top)  and  area  (bottom),  fixed  point  on  65nm  ASIC. 
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DFT  64  (floating  point)  on  ASIC  (65nm) 
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DFT  64  (floating  point)  on  ASIC  (65nm) 
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Figure  7. 12:  DFTfjj  throughput  versus  power  (top)  and  area  (bottom),  floating  point  on  65nm  ASIC. 
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DFT  256  (floating  point)  on  ASIC  (65nm) 
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25 


1,000 


800 


600 


400 


200 


0 


power  [W] 


DFT  256  (floating  point)  on  ASIC  (65nm) 
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Figure  7.13:  DFT256  throughput  versus  power  (top)  and  area  (bottom),  floating  point  on  65nm 
ASIC. 
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Figure  7.14:  DFT1024  throughput  versus  power  (top)  and  area  (bottom),  floating  point  on  65nm 
ASIC. 
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Figure  7.15:  DFT4096  throughput  versus  power  (top)  and  area  (bottom),  floating  point  on  65nm 
ASIC. 
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Pareto  optimal  set  of  designs  when  the  problem  size  is  large. 

Floating  point  designs  are  considerably  more  expensive  than  their  16  bit  fixed  point  counterparts. 
Further,  the  increased  area  and  complexity  lead  to  lower  synthesized  frequencies.  For  example, 
the  fastest  point  considered  in  the  fixed  point  DFT1024  experiment  can  be  synthesized  to  run  at 
858  MFIz,  producing  a  throughput  of  27.5  billion  samples  per  second.  The  corresponding  floating 
point  design  only  reaches  730  MHz,  giving  a  slower  throughput  of  23.4  billion  samples  per  second. 
Further,  the  area  and  power  requirements  increase  dramatically:  from  3.7  mm2  and  2.4W  to  18.3 
mm2  and  12.2  W. 

For  both  the  fixed  point  and  the  floating  point  experiments,  designs  up  to  streaming  width  w  = 
32  are  considered.  Wider  designs  can  easily  be  generated,  but  become  increasingly  more  difficult 
for  Design  Compiler  to  synthesize,  increasing  the  synthesis  time  and  memory  requirement,  so  they 
are  not  considered  in  these  experiments. 

7.4.2  Frequency  Scaling 

The  previous  results  showed  throughput  versus  power  and  area  when  all  designs  were  synthesized 
targeting  the  maximum  frequency.  However,  when  power  efficiency  is  a  concern,  it  is  also  necessary 
to  consider  designs  running  at  lower  clock  frequencies.  This  section  demonstrates  the  trends  in 
performance,  power,  and  area  when  designs  are  re-synthesized  at  lower  frequencies.  In  addition  to 
savings  that  may  be  provided  by  the  synthesis  tool,  area  and  power  are  further  saved  by  reducing 
the  amount  of  pipelining  necessary  in  the  arithmetic  units  (as  previously  shown  in  Table  7.1). 

Figure  7.16  revisits  the  16  bit  fixed  point  DFT1024  experiment  detailed  above.  Now,  the  Pareto 
suboptimal  points  are  removed,  leaving  only  the  Pareto  front,  shown  as  black  diamonds.  Then  gray 
diamonds  are  used  to  show  how  throughput,  power,  and  area  of  five  of  the  Pareto  optimal  designs 
are  effected  as  the  designs  are  regenerated  and  re-synthesized  targeting  200,  400,  and  600  MHz. 

As  the  frequency  is  decreased,  throughput  and  power  are  reduced  commensurately.  As  seen  in 
the  bottom  graph  of  Figure  7.16,  area  is  reduced  only  slightly  as  frequency  decreases. 

This  shows  that  it  can  be  more  power-efficient  to  generate  a  larger,  more  parallel  design  and  run 
it  at  a  lower  frequency  than  to  build  a  smaller  design  running  at  a  higher  clock  frequency.  These 
larger  designs  have  more  algorithmic  freedom,  so  they  can  be  built  with  more  efficient  algorithms, 
using  higher  radices. 
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Figure  7.16:  DFT1024,  fixed  point  on  ASIC:  throughput  versus  power  (top)  and  area  (bottom)  with 
frequency  scaling. 
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As  an  example,  Figure  7.16  shows  a  point  from  the  original  Pareto  optimal  set  at  approximately 
7  billion  samples  per  second,  with  area  of  0.9  mm2  consuming  approximately  0.5  Watts  (while 
clocked  at  1.4  GHz).  However,  similar  throughput  can  be  obtained  using  a  larger  design  clocked 
at  200  MHz,  which  requires  only  0.3  W.  However,  this  leads  to  an  in  increase  in  area;  the  larger 
design  requires  2.2  mm2.  This  type  of  tradeoff  between  area  and  power  given  a  fixed  performance 
constraint  is  explored  in  the  following  section. 

Although  the  graphs  in  this  section  only  demonstrate  these  effects  on  five  points  from  one  de¬ 
sign,  more  evaluations  (not  shown  here)  confirm  these  trends. 

7.4.3  Power/Area  Optimization  Under  Throughput  Requirement 

Often,  a  design  requires  that  a  transform  be  implemented  to  reach  a  given  throughput  requirement. 
The  frequency  scaling  experiments  discussed  above  can  be  modified  to  reflect  the  type  of  exploration 
required  for  such  a  situation. 

Figure  7.17  again  studies  fixed  point  implementations  of  DFT1024  on  65nm  ASIC  in  the  pres¬ 
ence  of  frequency  scaling.  However,  now  the  designs  are  not  synthesized  at  arbitrary  frequencies. 
Instead,  each  is  synthesized  at  precisely  the  frequency  needed  to  yield  a  performance  of  a  fixed 
throughput  target:  two  billion  samples  per  second,  illustrated  with  the  dashed  line.  As  shown  in  the 
top  plot,  reducing  the  frequency  of  the  more  parallel  designs  yields  much  lower  power  at  the  target 
throughput.  However,  as  shown  in  the  bottom  plot  of  Figure  7.17,  this  requires  a  much  larger  area 
than  the  less  parallel  designs. 

This  process  results  in  nine  different  designs,  each  with  equal  throughput.  Some  require  less 
power  but  more  area;  others  the  opposite.  Figure  7.18  collects  the  set  of  possible  tradeoffs  and 
displays  them  graphically.  The  y-axis  shows  power  of  each  design,  while  the  x-axis  shows  the  area. 
The  black  line  highlights  the  five  designs  that  are  Pareto  optimal  with  respect  to  power  and  area. 
Since  performance  is  equivalent  for  each  design,  the  designer  can  then  choose  whichever  point  best 
matches  the  required  power  and  area. 

Although  the  results  here  and  in  Section  7.4.2  examine  frequency  scaling  only  for  ASICs,  this 
technique  should  extend  to  FPGA  as  well. 
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Figure  7.17:  DFT  1024,  fixed  point,  throughput  versus  power/area  on  ASIC,  frequency  scaled  to 
reach  2  billion  samples  per  second. 
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DFT  1024  (16  bit  fixed  point)  on  ASIC  (65nm)  at  2  billion  samples  per  second 
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Figure  7.18:  DFT1024,  fixed  point,  power  versus  area,  fixed  throughput  of  2  billion  samples  per 
second. 
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7.5  Other  Transforms 

Although  most  of  this  evaluation  has  focused  on  the  discrete  Fourier  transform,  Spiral  generates 
designs  for  other  transforms  as  well.  This  section  demonstrates  results  for  the  two-dimensional 
discrete  Fourier  transform  DFT  -2D,  the  real  discrete  Fourier  transform  RDFT,  and  DCT-2,  the 
discrete  cosine  transform  of  type  2. 

7.5.1  Two-dimensional  DFT 

As  discussed  in  Chapter  6.2,  the  row-column  algorithm  for  computing  the  two-dimensional  DFT 
consists  of  two  iterative  product  terms,  giving  two  opportunities  for  iterative  reuse:  the  outer  term 
implemented  with  depth  d\,  and  an  inner  term  with  depth  da,  as  illustrated  in  Figure  6.4.  These  two 
degrees  of  freedom  in  iterative  reuse  are  combined  with  freedom  in  the  streaming  width  as  well  as 
the  algorithmic  freedoms  (including  radix),  yielding  a  wide  space  of  designs  to  explore. 

Figure  7.19  illustrates  throughput  versus  area  for  DFT -2Di6xi6  and  DFT -2Dg4X64,  with  16 
bit  fixed  point  data  type  on  Xilinx  Virtex-6  FPGA.  Four  different  data  markers  are  used  to  illustrate 
whether  each  design  employs  iterative  reuse  on  the  inner  and  outer  product. 

The  gray  line  highlights  the  Pareto  optimal  set  of  designs,  which  includes  designs  where:  both 
products  are  iteratively  reused  (diamonds),  only  the  outer  product  is  iteratively  reused  (circles),  and 
neither  product  is  iteratively  reused  (squares).  Although  not  visually  illustrated,  the  Pareto  optimal 
set  also  depends  on  the  freedoms  in  streaming  width,  algorithm,  and  radix  to  span  the  full  tradeoff 
space.  If  any  of  those  freedoms  were  restricted,  the  set  of  Pareto  optimal  points  would  be  reduced. 

7.5.2  Real  Discrete  Fourier  Transform 

Four  algorithms  for  the  real  discrete  Fourier  transform  (RDFT)  are  defined  in  Section  2.3.3.  First 
is  the  “Complex  half-size”  algorithm  (2.19),  which  casts  an  RDFTn  into  a  DFTn//2  with  a  post¬ 
processing  stage.  When  this  algorithm  is  used,  all  of  the  DFT  freedoms  discussed  above  are  relevant. 

Then,  three  native  RDFT  algorithms  are  defined:  a  constant  geometry  algorithm  (2.21),  an  itera¬ 
tive  algorithm  (2.22),  and  a  recursive  algorithm  (2.20).  As  explained  in  Section  6.3,  algorithm  (2.21) 
is  used  when  iterative  reuse  is  requested,  algorithm  (2.22)  is  used  when  no  iterative  reuse  is  utilized, 
and  the  recursive  algorithm  (2.20)  is  used  to  perform  the  basic  blocks. 
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Figure  7.19:  DFT  -2Di6Xi6  and  DFT  -2D64X64,  fixed  point,  throughput  versus  area  on  FPGA. 
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Lastly,  it  is  possible  to  compute  an  RDFTn  using  an  algorithm  for  DFT„  by  simply  utilizing 
the  real  inputs  only.  This  technique  has  higher  algorithmic  cost  than  the  others,  but  can  serve  as  a 
useful  baseline  for  comparison. 

Figure  7.20  shows  throughput  versus  area  for  Virtex-5  FPGA  implementations  of  RDFT128 
and  RDFT2048-  In  both  plots,  the  different  data  markers  represent  the  different  algorithms  listed 
above:  circles  for  the  complex  half-size  algorithm,  diamonds  for  the  native  RDFT  algorithms,  and 
triangles  for  the  complex  DFT  algorithm.  Streaming  widths  are  considered  from  w  =  4  to  w  =  32. 
In  this  experiment,  Spiral’s  memory  options  are  set  to  allow  no  block  RAMs  to  be  utilized  so  that 
any  differences  in  memory  requirements  between  the  algorithms  are  included  within  the  final  area 
count. 

For  both  plots,  the  smallest  point  on  the  Pareto  optimal  front  is  the  point  corresponding  to  the 
complex  DFT  algorithm.  From  there,  however,  the  complex  algorithm  becomes  suboptimal,  due  to 
its  increased  arithmetic  cost.  Then,  as  the  amount  of  area  increases,  the  native  RDFT  algorithms  ap¬ 
pear  on  the  Pareto  front  (diamonds),  followed  by  the  complex  half-size  RDFT  algorithms  (circles). 
The  exact  trend  and  ordering  of  these  points  depends  on  which  radices  are  available  based  upon  the 
problem’s  size. 

7.5.3  Discrete  Cosine  Transform 

This  thesis  also  includes  an  evaluation  of  an  algorithm  (2.23)  for  the  DCT  of  type  2.  Other  similar 
algorithms  suitable  for  hardware  generation  with  Spiral  are  given  in  [14,  16,  15]. 

Figure  7.21  shows  throughput  versus  area  for  DCT-2n  for  n  =  64,256,1024,4096  on  the 
Xilinx  Virtex-6  FPGA.  Each  line  represents  one  value  of  n  with  streaming  width  starting  from 
w  =  4  (bottom-left)  and  increasing  as  the  area/throughput  increases.  All  points  shown  here  are  fully 
streaming  (that  is,  no  iterative  reuse  is  utilized).  Several  designs  are  annotated  with  the  number  of 
BRAMs  and  DSP  slices  consumed  by  each  design. 

7.6  Evaluation  Summary 

This  chapter  provides  an  evaluation  of  FPGA  and  ASIC  implementations  of  transforms  generated  by 
Spiral  using  the  techniques  presented  in  this  thesis.  In  addition  to  the  two  platforms,  the  experiments 
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Figure  7.20:  RDFT128  and  RDFT2048>  fixed  point,  throughput  versus  area  on  FPGA. 
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Figure  7.21 :  DCT-2n,  fixed  point,  throughput  versus  area  on  FPGA.  Data  labels  indicate  the  number 
of  BRAMs  and  DSP  slices. 
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presented  here  include  different  data  types  (fixed  point  and  floating  point)  and  different  transforms 
(DFT,  DFT-2D,  DCT-2,  RDFT). 

Given  a  problem,  a  datatype,  and  a  platform,  Spiral  is  able  to  provide  a  space  of  designs  with 
varying  performance  and  costs.  This  set  of  designs  spans  multiple  algorithmic  options  and  multiple 
ways  of  mapping  from  algorithm  to  datapath.  The  flexibility  that  Spiral  allows  in  these  design 
dimensions  allow  the  user  to  easily  explore  a  wide  space  of  designs  in  order  to  determine  the  one 
that  best  meets  his  or  her  goals. 

Next,  Chapter  8  uses  the  proposed  Spiral  hardware  generation  framework  to  explore  and  imple¬ 
ment  designs  used  in  real-time  transceivers  for  orthogonal  frequency-division  multiplexing  (OFDM) 
for  optical  interconnects.  Then  Chapter  9  discusses  related  work,  and  in  particular  compares  the  de¬ 
signs  considered  here  with  approaches  found  in  the  literature. 


Chapter  8 


Orthogonal  Frequency-Division 
Multiplexing  for  Optical  Networks 


Orthogonal  frequency-division  multiplexing  (OFDM)  is  a  multi-carrier  modulation  technique  that 
is  widely  used  in  communication  applications.  The  most  computationally  expensive  portion  of 
OFDM  is  computing  the  DFT  and  its  inverse.  Recently,  the  optical  communications  community  has 
considered  OFDM  as  a  way  to  reduce  the  cost  of  short  distance  optical  interconnect  (for  example, 
in  data  centers)  or  improve  the  performance  of  long  distance  links. 

This  chapter  describes  an  application  study  where  the  Spiral  hardware  generation  system  is  used 
to  provide  implementations  of  the  DFT  and  IDFT  used  in  optical  OFDM  transmitters  and  receivers. 
This  work  was  published  in  [36,  37,  38,  39]  in  collaboration  with  the  listed  coauthors. 

8.1  FPGA  Prototype 

This  section  describes  the  construction  of  an  FPGA-based  prototype  of  an  optical  transmitter  (de¬ 
tailed  in  [36,  37,  38]).  Figure  8.1  shows  a  high-level  view  of  this  system,  implemented  in  part  on  a 
Xilinx  Virtex-4  FX  100  FPGA.  First,  input  bits  enter  the  system,  and  are  mapped  to  complex  sym¬ 
bols  using  quadrature  phase-shift  keying  (QPSK),  which  maps  two  bits  into  one  of  four  complex 
numbers.  Then,  a  collection  of  symbols  are  fed  as  an  input  vector  into  an  IDFT128  core.  Then,  the 
IDFT  is  computed,  and  the  output  data  travel  out  of  the  FPGA  into  a  4-bit  digital-to-analog  con¬ 
verter,  (running  at  2 1 .4  billion  samples  per  second),  and  into  a  Mach-Zehnder  optical  modulator. 
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Figure  8.1:  OFDM  FPGA  prototype  transmitter. 


Each  output  frame  (128  samples)  encodes  25  QPSK  symbols,  totaling  50  bits.  Thus,  the  system 
transmits  at  a  rate  of  21.4  x  50/128  =  8.35  gigabits  per  second.  The  transmitter’s  bit  error  rate  is 
less  than  10~3. 

The  remainder  of  this  section  describes  how  the  Spiral  hardware  generation  tool  was  used  in 
the  exploration  and  implementation  of  the  IDFT128  needed  within  this  transmitter.  The  IDFT  core 
operates  on  hxed-point  complex  data  and  must  appropriately  scale  the  data  to  guarantee  that  over¬ 
flow  will  not  occur.  The  implementation  must  support  a  throughput  of  one  transform  per  cycle  (128 
complex  samples  per  cycle)  at  a  frequency  of  167.25  MHz.  This  gives  an  overall  throughput  re¬ 
quirement  of  167.25  million  transforms  per  second  or  21.4  billion  samples  per  second.  Typically, 
hardware  implementations  of  the  IDFT  utilize  sequential  reuse,  as  described  in  Chapter  3.  However, 
in  the  proposed  application,  all  128  data  elements  must  be  processed  in  parallel  (as  opposed  to  a 
few  at  a  time)  in  order  to  meet  the  throughput  requirement.  Therefore,  a  fully  unrolled  datapath 
(one  that  does  not  include  any  sequential  reuse)  is  feasible.  Such  a  datapath  requires  a  large  number 
of  computational  elements,  but  allows  additional  freedom  in  the  algorithm  selection.  This  design 
study  includes  evaluation  of  automatically  generated  IDFT  cores  derived  from  two  families  of  IFFT 
algorithms:  the  Pease  algorithm  (2.13),  which  is  regular  and  simple,  and  algorithms  based  on  the 
Cooley-Tukey  algorithm  (2.12),  which  has  a  less  regular  structure,  but  when  higher  radices  are  used, 
lower  computational  cost. 
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Figure  8.2:  Area  versus  fixed-point  precision  of  IDFT  cores  for  OFDM  on  FPGA. 

Pease  IFFT  Algorithm.  The  radix  2  Pease  IFFT  algorithm  has  a  highly  regular  structure  and  is 
frequently  used  in  hardware  implementations  of  the  IFFT.  The  algorithm  (for  n  data  points)  consists 
of  log2(n)  stages,  each  performing  n/2  “butterflies”  or  basic  blocks  that  perform  one  addition, 
one  subtraction,  and  one  complex  multiplication.  A  generated  fully  unfolded  implementation  for 
n  =  128  requires  2308  adders/subtractors  and  908  multipliers  (each  operating  on  real  values). 
The  tool’s  automatic  pipelining  results  in  36  stages.  Figure  8.2  (black  circles)  plots  the  number 
of  FPGA  slices  (reconfigurable  logic  elements)  required  against  the  fixed-point  precision  (in  bits). 
Each  design  is  synthesized  and  plate/routed  using  the  Xilinx  Integrated  Software  Environment  (ISE) 
suite  of  FPGA  tools.  Using  this  algorithm,  only  the  4-,  6-,  and  8-bit  designs  fit  on  the  target  FPGA. 

Searching  across  a  family  of  algorithms.  Next,  a  wider  set  of  IFFT  algorithms  are  considered, 
based  on  (2.12),  which  is  recursive.  Using  Spiral,  many  different  breakdowns  of  these  algorithms 
are  generated  and  flattened  into  an  unrolled  structure.  Then,  search  is  employed  to  identify  the 
algorithm  with  the  lowest  algorithmic  cost.  For  this  system,  the  best  algorithm  found  is  a  mix  of 
radix  8  and  radix  16  IFFTs,  resulting  in  2192  adders/subtractors  and  664  multipliers  (a  reduction 
of  more  than  25%  of  the  multipliers  used  in  the  Pease  implementation).  Again,  designs  based  on 
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Mean  RMS  Error 

Std.  Dev.  RMS  Error 
Mean  SNR 

Std.  Dev.  SNR 

6.6  x  10“2 

2.2  x  10“3 
8.9  x  10-2 

6.2  x  10-3 

1.9  x  10-2 
5.4  x  10"4 
1.3 

7.8  x  1CT2 

4.9  x  10“3 
1.3  x  10-4 
2.0  x  101 

1.1 

1.2  x  10"3 
3.6  x  10"5 

3.3  x  102 

1.9  x  101 

3.0  x  10“4 
8.6  x  10“6 
5.3  x  103 

3.0  x  102 

7.5  x  10-5 
2.2  x  10-6 
8.8  x  104 
5.1  x  103 

Table  8.1:  RMS  error  and  SNR  for  IDFT128  designs. 


this  algorithm  are  generated  for  various  values  of  fixed-point  precision.  Figure  8.2  plots  the  number 
of  slices  needed  for  each  design  (as  black  triangles).  Designs  based  on  this  algorithm  require  on 
average  37%  fewer  slices  than  their  Pease  IFFT  counterparts.  Designs  of  up  to  12  bits  of  precision 
are  able  to  fit  on  the  target  FPGA,  an  improvement  of  four  bits. 

Numerical  Accuracy.  Table  8.1  presents  an  analysis  of  the  root-mean-squared  error  (RMSE) 
and  SNR  for  the  IDFT  generated  designs.  For  each,  512  IDFT128  computations  are  performed 
on  OFDM  input  symbols,  and  RMSE  and  SNR  are  calculated  for  each  vector.  For  each  design, 
Table  8.1  reports  the  mean  and  standard  deviation  of  the  512  values  for  RMSE  and  SNR,  calculated 
as  follows,  where  S  is  the  measured  output  vector  and  T  is  the  expected  output  vector: 


RMSE 


N-l 


\Sn~Tn\> 

N 


SNR 


v-'A  — 1  \T  12 
l^  n= 0 


Z_-m=0 


|  Sn 


T  2 

n 


A  two  bit  increase  in  fixed-point  precision  causes  the  maximum  magnitude  of  each  data  word 
to  increase  by  a  factor  of  4.  Table  8.1  illustrates  that  such  an  increase  yields  a  decrease  in  RMSE  of 
approximately  a  factor  of  4  and  an  increase  in  SNR  of  approximately  a  factor  of  42  =  16. 

Pruning.  Depending  on  which  subcarriers  are  utilized,  a  number  of  IFFT  inputs  will  always 
be  zero.  Because  this  study  considers  fully  unrolled  designs,  this  makes  it  possible  to  prune  out  a 
number  of  unnecessary  operations  from  the  early  stages  of  the  IFFT,  resulting  in  a  modest  decrease 
in  required  area.  This  is  done  by  adding  a  wrapper  around  the  IFFT  core  and  allowing  the  synthesis 
tool  to  perform  simplification.  In  Figure  8.2  (white  data  markers),  this  technique  is  evaluated  on 
both  IFFT  algorithms  discussed  earlier,  assuming  that  only  a  quarter  of  the  IFFT  inputs  are  nonzero. 
On  average,  this  results  in  an  8.3%  decrease  in  the  number  of  slices. 


8.2.  ASIC  Design  Study 
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8.2  ASIC  Design  Study 

Further  work  in  [39]  presents  a  design  study  of  a  higher-throughput  OFDM  transceiver  synthesized 
as  an  ASIC.  This  transceiver  assumes  signal  converters  (digital-to-analog  and  analog-to-digital  con¬ 
verters)  that  process  28  billion  samples  per  second,  with  6  bit  resolution  on  the  D/A  and  8  bit 
resolution  on  the  A/D.  This  higher  precision  allows  more  channels  of  the  transmitted  signal  to  carry 
data,  allowing  a  higher  overall  data  rate.  The  system  encodes  100  bits  per  128  transmitted  samples, 
giving  an  overall  data  rate  of  28  x  100/128  =  21.875  gigabits  per  second.  The  paper  [39]  then  per¬ 
forms  simulation  to  determine  the  fixed  point  precision  needed:  10  bits  for  the  transmitter  (which 
includes  IDFT^s)  and  14  bits  for  the  receiver  (which  includes  DFTi28- 

Next,  Spiral  generates  twenty  different  IDFT  and  DFT  implementations  (with  10  bit  and  14 
bit  fixed  point  data  types,  respectively).  All  designs  are  based  on  the  Iterative  Cooley-Tukey  FFT 
algorithm  with  varying  radices  (2.15).  The  designs  considered  process  either  32,  64,  or  128  samples 
per  cycle,  and  thus  must  be  clocked  at  875,  437.5,  or  218.75  MHz  (respectively)  in  order  to  meet 
the  throughput  requirement  of  28  Gsamples/s.  To  meet  the  OFDM  performance  target,  each  design 
must  perform  approximately  1012  fixed  point  operations  per  second.  The  designs  that  process  more 
samples  per  cycle  have  a  higher  area  but  are  clocked  at  a  lower  frequency  and  thus  consume  less 
power.  This  allows  a  tradeoff  between  power  consumption  and  area. 

Each  design  is  synthesized  using  Synopsys  Design  Compiler  Ultra  targeting  a  commercial  65nm 
standard  cell  library.  Transceivers  are  constructed  from  each  generated  IDFT  and  DFT  implemen¬ 
tation  by  integrating  appropriate  modules  for  QPSK  mapping,  de-mapping,  and  clipping/scaling. 
Then,  each  is  synthesized  and  its  frequency,  area,  and  power  are  estimated.  Figure  8.3  shows  De¬ 
sign  Compiler’s  reported  area  and  power  consumption  for  each  design.  Each  point  on  the  graph 
indicates  the  power  (y-axis)  and  area  (x-axis)  of  a  single  implementation  of  the  transmitter  or  re¬ 
ceiver.  The  solid  lines  indicate  the  Pareto-optimal  set  of  designs  (the  set  of  designs  that  give  the 
best  tradeoff  between  power  and  area).  The  transmitter’s  Pareto-optimal  set  contains  five  designs, 
ranging  from  0.72  mm2  and  0.44  W  to  0.95  mm2  and  0.17  W.  The  receiver’s  Pareto-optimal  set 
contains  three  designs  ranging  from  1.14  mm2  and  0.73  W  to  1.35  mm2  and  0.24  W. 

Typically,  power  is  the  most  critical  constraint  within  the  system,  so  one  would  choose  the 
lowest/rightmost  design  from  the  Pareto-optimal  set.  However,  if  the  system  area  is  constrained, 
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OFDM  Tx  and  Rx,  ASIC  Synthesis  (65nm) 
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Figure  8.3:  Power  versus  area  for  OFDM  transmitters  and  receivers. 


a  smaller  (yet  higher  power)  design  could  be  selected.  The  most  power-efficient  transmitter  con¬ 
sumes  7.6  mW/Gb/s,  while  the  most  power-efficient  receiver  consumes  10.7  mW/Gb/s.  In  both  the 
transmitter  and  receiver,  the  most  power-efficient  designs  process  128  samples  per  cycle  at  218.75 
MHz,  while  the  most  area-efficient  designs  process  32  samples  per  cycle  at  875  MHz.  For  these 
architectures,  the  mixed  radix  Cooley-Tukey  FFT  algorithm  has  lowest  cost  when  radices  8  and  16 
are  used.  Simpler  algorithms,  such  as  the  commonly-seen  radix  2  FFT  have  higher  computational 
cost  and  correspond  to  the  least  efficient  points  seen  on  the  graph. 


Chapter  9 


Related  Work 

9.1  Formula-based  Hardware  Representation 

The  tensor  formula  language  used  in  this  thesis  has  been  previously  used  as  a  way  to  describe 
transform  algorithms.  For  example  [1,  7,  4]  use  this  type  of  representation  to  discuss  software 
implementations  of  the  DFT. 

Although  the  work  presented  in  this  thesis  (published  in  part  in  [17])  was  the  first  to  extend  the 
tensor  formula  language  to  support  a  general  class  of  hardware  implementations,  variants  of  this 
mathematical  language  have  been  used  in  the  process  of  constructing  special  purpose  hardware  for 
the  DFT.  For  example,  [40]  uses  the  tensor  product-based  mathematical  language  in  the  process  of 
designing  a  universal  DFT  processor  that  is  scalable  in  the  number  of  processing  elements.  More 
recently,  [41]  studies  pipelined  architectures,  using  the  tensor  language  to  describe  a  family  of  FFT 
algorithms  and  providing  a  hardware  interpretation  of  certain  formula  constructs.  For  the  discrete 
cosine  transform,  [14,  15,  16]  use  the  tensor  formula  representation  as  a  way  to  algorithms  suitable 
for  pipelined  hardware  implementation. 

9.2  Hardware  Implementation  of  Transforms 

This  section  gives  an  overview  of  work  in  the  literature  on  hardware  architectures  for  transforms. 
First,  Sections  9.2.1,  9.2.2,  and  9.2.3  discuss  fully-folded,  pipelined,  and  other  flexible  implementa¬ 
tions  of  the  DFT.  Section  9.2.4  examines  work  that  eliminates  or  reduces  the  number  of  multipliers 
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needed  in  a  DFT  implementation,  and  Section  9.2.5  discusses  systolic  array  architectures  for  the 
DFT.  Then,  Sections  9.2.6  and  9.2.7  show  examples  of  recent  work  on  non-power-of-two  sized 
DFTs  and  other  transforms. 

9.2.1  Fully-Folded  DFT  Processors 

The  simplest  hardware  implementations  of  the  discrete  Fourier  transform  are  fully-folded  architec¬ 
tures  that  use  a  small  processing  element  sequentially  over  the  course  of  the  transform  computation. 
The  cost  and  performance  of  these  fully-folded  processors  is  closest  to  the  designs  in  this  thesis 
with  full  streaming  and  iterative  reuse. 

An  early  example  can  be  found  in  [42],  which  consists  of  a  complex  butterfly  and  multiplier, 
two  memories,  and  control  logic.  This  idea  is  extended  in  [43],  which  implements  a  radix  r  FFT 
algorithm  on  a  system  with  one  DFTr  basic  block  and  2 r  memories. 

The  basic  block  in  this  type  of  system  is  improved  in  [44],  which  partitions  the  data  into  two 
memory  banks.  Then,  two  data  can  be  read,  pass  through  a  butterfly  unit,  and  be  written  on  every 
cycle.  The  address  and  bank  locations  are  computed  using  inexpensive  bit  arithmetic  on  the  vector 
element  number.  In  this  way,  this  work  can  be  viewed  as  a  predecessor  of  the  streaming  permutation 
implementations  in  [20]. 

More  recently,  [45]  partitions  the  memory  in  a  system  of  this  type  into  a  small  cache  and  a  larger 
main  memory.  Then,  the  FFT  factorization  is  chosen  to  make  best  use  of  the  data  in  the  local  cache. 

Architectures  of  this  type  have  also  been  implemented  on  FPGA.  For  example,  [46]  uses  two 
radix  4  processing  elements  and  three  banks  of  memory  on  a  Xilinx  Virtex-II  FPGA. 

9.2.2  Pipelined  DFT  Implementations 

A  common  structure  for  implementing  the  DFT  is  the  pipelined  (or  streaming)  architecture.  These 
designs  are  popular  because  they  are  relatively  easy  to  implement  and  typically  obtain  much  higher 
performance  than  the  fully-folded  processors  discussed  above.  Additionally,  designs  of  this  type  are 
easy  to  integrate  into  surrounding  systems;  typically  they  take  in  and  emit  one  complex  word  per 
cycle.  This  type  of  architecture  corresponds  to  designs  with  no  iterative  reuse  and  full  streaming 
reuse  (width  w  =  1).  Although  this  is  narrower  than  the  smallest  width  considered  in  the  designs 
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architecture 

throughput 

multipliers 

mult. 

adders 

adder 

apx.  mem. 

(words  /  eye.) 

util. 

util. 

size 

R2SDF 

1 

log2  n  -  2 

50% 

2  log2  n 

50% 

n 

R2MDC 

1 

log2  n  -  2 

50% 

2  log2  n 

50% 

1.5  n 

R2MDC-mod 

2 

log2  n-  2 

100% 

2  log2  n 

100% 

9.5  n 

Spiral  radix  2 

w 

(ut/2)(log2n-2) 

100% 

w  log2  n 

100% 

6  n 

R4SDF 

1 

log4  n—  1 

75% 

8  log4  n 

25% 

n 

R4MDC 

1 

3(log4  n  -  1) 

25% 

8  log4  n 

25% 

2.5  n 

R4MDC-mod 

4 

3(log4  n  -  1) 

100% 

8  log4  n 

100% 

18.5  n 

R4SDC 

1 

log4  n  —  1 

75% 

3  log4  n 

75% 

2  n 

R22SDF 

1 

log4  n  -  1 

75% 

4  log4  n 

50% 

n 

Spiral  radix  4 

w 

(3u>/ 4)(log4  n  —  1) 

100% 

2  w  log4  n 

100% 

4.67  n 

All  multipliers  and  adders  operate  on  complex  data,  and  all  implementations  have  bit/digit-reversed  output. 

Table  9.1:  Summary  of  pipelined  DFT  architectures. 


of  this  thesis,  typically  these  designs  sacrifice  performance/cost  scalability.  That  is,  most  can  be 
thought  of  as  architectures  for  the  special  case  of  w  =  1. 

There  are  two  common  types  of  pipelined  DFT  architectures:  single-path  delay-feedback  (ab¬ 
breviated  SDF)  and  single-path  or  multi-path  delay-commutator  (abbreviated  SDC  and  MDC),  us¬ 
ing  the  terminology  from  [47].  In  each  case,  delay  elements  and  switches  are  used  to  buffer  and 
reorder  the  data  streams  as  required  by  the  algorithm.  Table  9.1  summarizes  the  costs  and  through¬ 
put  of  different  versions  of  these  designs  and  compares  them  with  the  closest  designs  from  the 
framework  described  in  this  thesis.  In  all  cases,  the  designs  in  Table  9.1  take  data  in  natural  input 
order  and  produce  it  in  radix  2  or  4  digit  reversed  order  (that  is,  reordered  by  Ilf). 

Early  examples  of  SDF  architectures  can  be  found  in  [48],  which  implements  a  simple  radix-2 
pipeline,  and  in  [49]  which  utilizes  a  similar  structure,  but  reduces  the  amount  of  memory  required 
(see  line  labeled  “R2SDF”  in  Table  9.1). 

The  MDC  architecture  is  introduced  in  [50]  has  a  higher  memory  cost  with  the  same  throughput 
for  one  DFT„  (line  “R2MDC”  in  Table  9.1).  However,  this  technique  is  able  to  compute  two  inter¬ 
leaved  DFTs  concurrently  with  double  the  throughput.  In  other  words,  it  can  compute  DFTn  (g>/2 
with  a  throughput  of  two  words  per  cycle  and  no  additional  overhead.  In  order  to  convert  this  into 
a  traditional  throughput-driven  streaming  design  like  those  considered  in  this  thesis,  the  expression 
can  be  rewritten 

DFTn  ®I2  -  L2nn  •  (I2  ®  DFTn)  •  L2n. 


The  streaming  stride  permutations  L  can  then  be  computed  using  the  bit-matrix-based  streaming 
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permutation  method  described  in  Chapter  5.2.  This  design,  with  added  overhead  for  the  L  per¬ 
mutations,  is  given  in  line  “R2MDC-mod”  in  Table  9.1).  Although  this  allows  speedup  with  no 
arithmetic  cost,  the  memory  cost  is  substantial.  In  the  literature,  this  technique  is  not  considered  for 
this  architecture;  existing  papers  only  consider  this  design  for  the  interleaved  case. 

These  techniques  can  also  be  extended  to  higher  radices,  as  shown  in  the  table’s  “R4SDF”  and 
“R4MDC”  lines.  As  shown,  the  radix  4  SDF  architecture  reduces  the  number  of  multipliers  while 
increasing  the  number  of  adders.  The  radix  4  MDC  architecture  [51]  has  higher  costs,  but  again  can 
be  used  with  interleaved  data  (computing  DFTn  181/4)  which  can  be  rewritten  as  in  the  radix  2  case 
above  (indicated  “R4MDC-mod”  here). 

Further  work  then  expands  on  these  designs.  In  [52],  the  commutator  designs  of  [50,  51]  are 
simplified,  producing  the  radix  4  SDC  design  “R4SDC”.  This  design  reduces  both  arithmetic  and 
memory  requirements,  but  it  does  not  have  the  property  of  the  radix  r  MDC  architecture  that  allows 
it  to  compute  DFTn  0/r  with  no  additional  hardware  cost. 

He  and  Torkelson  [47]  summarize  various  types  of  FFT  pipeline  processor  architectures  and 
propose  a  “radix  22”  single -path  delay  feedback  (SDF)  architecture  that  combines  the  processing 
element  structure  of  earlier  radix  2  designs  with  the  higher-level  implementation  of  a  radix  4  algo¬ 
rithm.  This  algorithm  and  datapath  arc  used  by  Xilinx  in  the  LogiCore  FFT  IP  that  this  thesis  uses 
as  a  benchmark  in  Section  7.3.  Essentially,  [47]  maps  a  radix  4  algorithm  onto  an  SDF  pipeline 
with  basic  blocks  that  are  more  similar  to  radix  2  basic  blocks  than  previous  techniques  had  used. 
In  a  straightforward  implementation  of  an  SDF  radix  4  algorithm,  there  would  be  added  complexity 
because  the  algorithm  is  not  mapped  well  to  a  hardware  structure  that  matches  the  overall  pipeline. 
This  distinction  between  radix  2  and  radix  4  basic  blocks  (and  their  utilization)  is  only  relevant  in 
this  manner  type  of  limited-utilization  pipeline.  The  drawbacks  of  going  to  higher  radices  described 
in  [47]  are  not  encountered  when  using  the  techniques  proposed  in  this  thesis. 

Each  of  these  architectures  has  throughput  of  1,  2  or  4  complex  words  per  cycle.  In  contrast, 
the  fully-streaming  designs  from  this  thesis  have  a  throughput  of  w  per  cycle  (w  =  2k,  w  \  n,  and 
w  >  r). 

To  better  illustrate  the  comparison  between  the  different  pipelined  architectures,  Table  9.2  nor¬ 
malizes  the  throughput,  multipliers,  adders,  and  approximate  memory  for  the  considered  designs. 

The  designs  built  generated  by  Spiral  compare  well  with  the  existing  designs.  Relative  to  the 
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architecture 

throughput 

mults 

adders 

apx.  memory 

R2SDF 

1 

2 

1.33 

i 

R2MDC 

1 

2 

1.33 

1.5 

R2MDC-mod 

2 

2 

1.33 

9.5 

Spiral  radix  2 

w 

w 

2w/3 

6 

R4SDF 

1 

1 

2.67 

1 

R4MDC 

1 

3 

2.67 

2.5 

R4MDC-mod 

4 

3 

2.67 

18.5 

R4SDC 

1 

1 

1 

2 

R22SDF 

1 

1 

1.33 

1 

Spiral  radix  4 

w 

3w/4 

2w/3 

4.67 

Table  9.2:  Summary  of  pipelined  DFT  architectures,  normalized. 

R2SDF  architecture,  the  proposed  radix  2  designs  have  w  x  higher  throughput  with  w/2  x  the  num¬ 
ber  of  adders,  w/ 2x  the  number  of  multipliers,  and  6x  the  number  of  memories.  So,  for  w  >  I,  the 
proposed  method  yields  superlinear  speedup  in  all  metrics.  For  w  <  4,  it  yields  superlinear  speedup 
except  with  respect  to  memory  required.  Relative  to  R2MDC-mod,  the  proposed  radix  2  design  has 
w/2x  the  speedup  with  w / 2  x  the  multiplier  and  adder  cost  at  63%  of  the  memory  requirement. 

For  radix  4,  Table  9.2  shows  similar  comparisons.  Compared  to  the  R22SDF  architecture,  the 
proposed  design  has  a  speedup  of  w  with  3m/4x  multipliers,  w/2x  adders,  and  4.67  times  the 
memory.  So,  for  w  >  4,  the  proposed  design  yields  superlinear  speedup  in  all  metrics.  For  w  =  4, 
it  yields  superlinear  speedup  except  with  respect  to  memories. 

Lastly,  [41]  presents  a  technique  (related  to  the  tensor  formula  language  used  in  this  thesis)  to 
represent  arbitrary  radix  SDC  and  SDF  datapaths.  These  techniques  could  be  incorporated  into  the 
generation  system  considered  in  this  thesis. 

9.2.3  Increasing  Flexibility 

In  addition  to  the  sequential  processors  and  narrow  pipelines  discussed  above,  some  work  in  the 
literature  has  allowed  more  freedom  in  the  width  and  depth  of  a  datapath. 

For  example,  [53]  includes  discussion  of  a  parallel  iterative  architecture  that  consists  of  one 
fully  parallel  stage  of  the  FFT  with  full  feedback.  That  is,  full  iterative  reuse  (depth  d  =  1)  and  no 
streaming  reuse. 

Another  type  of  design  is  shown  in  [54],  which  maps  the  Pease  FFT  algorithm  onto  a  rectangular 
array  processors  consisting  of  one  column  of  processing  elements  connected  with  an  interconnection 
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network  that  performs  L”  ,T  Other  work  [55,  56]  are  similar  ([56]  considers  arbitrary  radix  r). 

Wider  versions  of  pipeline  architectures  are  considered  in  [23],  which  presents  a  method  to  take 
a  standard  pipelined  FFT  implementation  and  increase  its  streaming  width  resulting  in  what  they  call 
a  “partial  column”  implementation.  With  this  approach,  the  difficult  problem  becomes  performing 
streaming  stride  permutations  with  non-trivial  streaming  width.  In  this  work,  a  technique  with 
memories  and  interconnection  networks  is  provided  to  perform  with  streaming  width  2k  where 
k  >  n  —  s.  As  shown  in  [20],  the  connection  networks  used  in  [23] ’s  solution  are  more  expensive 
than  those  used  in  the  bit-matrix  streaming  permutation  method  (see  Chapter  5.2). 

Reference  x[40]  implements  the  Dimensionless  FFT  algorithm,  which  allows  multiple  dimen¬ 
sions  of  DFT  to  be  computed  only  by  changing  the  constants  stored  in  diagonal  matrices.  For 
example,  one  implementation  can  compute  DFTi6,  (DFT2  <g>  DFTs),  (DFT40DFT4),  and  so 
on.  The  architecture  considered  here  is  similar  to  [54],  with  a  set  of  processors  in  parallel  connected 
via  interconnection  network. 

Reference  [57]  studies  the  energy-efficiency  of  FFT  implementations  on  a  Xilinx  Virtex-II  Pro 
FPGA,  considering  pipelined  and  iterative  reuse  designs.  In  [58],  floating  point  implementations 
on  FPGA  are  considered;  all  designs  examined  use  streaming  reuse  of  parameterizable  width,  and 
iterative  reuse  with  depth  1,  2,  or  log2  n  (that  is,  no  iterative  reuse). 

Other  recent  works  [59]  and  [60]  use  high-level  synthesis  techniques  to  get  architectural  flexi¬ 
bility  from  one  high-level  description.  These  are  discussed  below  in  Chapter  9.3. 

9.2.4  Multiplierless  Implementations 

Some  prior  work  has  focused  on  avoiding  multiplication  within  hardware  implementations  of  the 
DFT.  For  example,  [61],  [62],  and  [63]  replace  the  complex  multipliers  in  the  SDF  pipelined  ar¬ 
chitectures  and  parallel  pipeline  architectures  (as  discussed  above)  with  CORDIC  complex  rotators. 
This  eliminates  the  need  for  complex  multipliers  and  lookup  tables.  More  recently,  [64]  revisited 
these  techniques  and  reduced  the  cost  of  the  rotator  unit.  Replacing  multipliers  with  CORDIC  ro¬ 
tators  is  orthogonal  to  the  parameters  studied  in  this  thesis;  such  techniques  could  be  applied  as  an 
alternate  method  for  implementing  the  diagonal  matrices  that  contain  twiddle  factors. 

Other  work  has  focused  on  reducing  the  number  of  multipliers  in  other  ways.  In  [65],  a  nearly 
multiplierless  implementation  of  an  FFT  is  proposed  that  reformulates  the  transform  as  a  cyclic 


9.2.  Hardware  Implementation  of  Transforms 


122 


convolution.  The  result  is  a  width  w  =  1  pipeline  that  requires  only  two  multipliers  but  3n/4  +  13 
adders  to  compute  DFTn.  In  practice,  this  technique  is  only  feasible  for  small  problem  sizes.  At 
n  =  1024,  [65]  requires  2  multipliers  and  781  adders  (versus  8  multipliers  and  20  adders  for  the 
simple  radix  2  SDF  datapath  [49]  discussed  above). 

A  benefit  of  the  technique  in  [65]  is  increased  transform  accuracy  for  the  same  fixed  point 
input/output  precision.  For  example,  the  authors  show  that  a  standard  multiplier-based  pipeline 
needs  28  bits  to  match  the  precision  of  their  proposed  22  bit  design.  They  further  show  that  for 
small  sizes  (n  =  16,  32,  64),  their  technique  requires  less  resources  than  the  R2SDF  pipeline. 

9.2.5  Systolic  Arrays 

The  DFT  has  also  been  implemented  using  systolic  arrays.  Often,  these  implementations  use 
0(n2)  algorithms  that  compute  the  DFT  as  a  matrix- vector  multiplication  or  other  similar  tech¬ 
nique  [66,  67,  68].  The  advantage  of  a  systolic  array  is  that  it  is  parallel  and  regular;  the  algorithms 
used  in  these  papers  map  well  to  this  type  of  system.  However,  this  comes  at  the  cost  of  higher 
arithmetic  cost  (number  of  operations  per  transform).  This  disparity  grows  greatly  as  the  problem 
size  increases. 

Another  approach  [69]  combines  an  0(n  log  n)  FFT  algorithm  at  the  highest  level,  with  a  lower- 
level  systolic  array  to  compute  the  basic  blocks.  In  the  end,  techniques  like  this  end  up  structurally 
similar  to  other  DFT  architectures  such  as  those  discussed  above. 

9.2.6  Non-two-power  sizes 

Most  previous  work  on  hardware  implementations  of  non-power-of-two  sized  DFTs  has  focused  on 
producing  a  solution  for  a  specific  situation  (a  given  problem  size  and  performance  requirement). 
For  example,  [70]  and  [71]  consider  the  specific  problem  sizes  and  performance  requirements  of  the 
Digital  Radio  Mondiale  (DRM)  radio  broadcasting  standard,  while  [72]  presents  an  FPGA-based 
pipeline  design  of  a  3,780  point  inverse  DFT  used  in  a  digital  television  standard. 

A  more  flexible  solution  is  given  in  [73],  which  uses  a  convolution-based  approach  to  provide 
an  algorithm  and  architecture  for  computing  the  DFT  when  the  problem  size  is  a  prime  number  or  a 
product  of  primes. 


123 


Chapter  9.  Related  Work 


9.2.7  Other  Transforms 

Similar  implementations  of  other  transforms  also  appear  in  the  literature.  For  example,  [74]  maps 
an  RDFT  algorithm  to  a  datapath  similar  to  the  SDF  pipelines  commonly  seen  for  the  DFT. 

Another  RDFT  implementation  appears  in  [75],  where  an  algorithm  similar  to  the  complex  half¬ 
size  RDFT  algorithm  (2.19)  is  used  to  compute  RDFT„  using  DFTn/2  and  post-processing.  This 
paper  then  computes  DFTn/2  using  a  fully-folded  DFT  processor  similar  to  the  ones  discussed  in 
Section  9.2.1. 

Other  papers  explore  these  types  of  implementations  of  discrete  cosine  and  sine  transforms 
(DCTs  and  DSTs).  For  example,  [15]  derives  constant  geometry  algorithms  for  DCTs  and  DSTs 
that  are  similar  in  structure  to  the  Pease  FFT  (2.13).  In  [16],  this  algorithm  is  simplified  for  the 
special  case  of  DCT-2.  Later,  [14]  extends  these  ideas  further  to  enable  pipelined  architectures  for 
DCTs  and  DSTs  of  type  2,  3,  and  4.  This  is  the  source  of  the  DCT-2  algorithm  presented  in  this 
thesis  as  (2.23)  and  evaluated  in  Section  7.5.3. 

9.3  High-Level  Synthesis 

High-level  synthesis  (HLS)  is  an  automated  process  that  takes  as  input  a  behavioral  description  of 
a  digital  system  and  produces  a  register-transfer  level  implementation.  By  separating  algorithm 
specification  from  implementation  issues,  HLS  aims  to  ease  the  burden  on  designers  by  allowing 
them  to  ignore  lower-level  implementation  issues.  Further,  by  allowing  multiple  datapaths  to  be 
built  from  the  same  specification,  HLS  aims  to  lower  the  cost  of  design  space  exploration.  A  recent 
overview  of  HLS  can  be  found  in  [76]. 

Many  commercial  products  for  high-level  synthesis  are  available,  based  on  different  input  lan¬ 
guages.  Many  are  based  on  subsets  of  C,  C++,  or  SystemC,  including  Catapult  C  Synthesis  (Mentor 
Graphics),  Handel-C  (Mentor  Graphics),  Synphony  (Synopsys),  Cadence  C-to-Silicon  Compiler, 
and  Cynthesize  (Forte  Design  Systems).  Bluespec  Compiler  is  based  on  Bluespec  SystemVerilog, 
which  extends  a  synthesizable  subset  of  SystemVerilog  to  include  atomic  transactions.  Lastly,  other 
HLS  tools  are  based  on  graphical  environments,  such  as  NI  Lab  VIEW  (National  Instruments)  and 
Simulink  HDL  Coder  (Math Works). 

Some  recent  papers  have  explored  the  benefits  of  using  HLS  to  implement  the  DFT  in  FPGA 
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or  ASIC.  First,  [77]  implements  the  radix  22  SDF  structure  from  [47]  on  an  FPGA  using  a  C-based 
high-level  synthesis  language  called  Handel-C.  This  design  is  parameterizable  in  the  number  of 
bits  to  use  for  data  and  twiddle  factors,  the  transform  size  (number  of  points),  and  scaling  factors. 
However,  it  does  not  consider  any  architectural  tradeoffs. 

Reference  [59]  presents  an  ASIC  transmitter  for  the  802.1  la  wireless  standard  written  in  Blue- 
spec  System Verilog.  The  most  computationally  expensive  portion  of  this  transmitter  is  the  IDFT64- 
This  implementation  is  of  the  radix-2  Cooley-Tukey  FFT,  and  it  uses  the  capabilities  of  the  HLS 
language  to  allow  the  design  to  be  parameterizable  in  streaming  width  and  iterative  reuse  depth. 
This  paper  also  examines  the  tradeoff  between  area  and  power  at  a  fixed  performance  target.  Simi¬ 
lar  to  the  evaluation  this  thesis  performs  in  Section  7.4.3,  [59]  obtains  a  tradeoff  between  area  and 
power  at  a  fixed  performance  target. 

Similarly,  [60]  targets  the  DFT  on  an  FPGA,  expressing  a  radix  2  FFT  algorithm  as  a  double 
loop  using  National  Instruments  Lab  VIEW.  Each  loop  can  then  be  unrolled  by  the  tool.  The  outer 
loop  corresponds  to  the  FFT’s  stages;  unrolling  corresponds  to  increasing  the  depth  of  the  imple¬ 
mentation.  Similarly,  the  inner  loop  corresponds  to  the  individual  processing  elements  within  the 
stage;  unrolling  this  loop  corresponds  to  increasing  the  streaming  width. 

These  papers  present  a  glimpse  of  the  strength  of  high-level  synthesis:  by  expressing  an  al¬ 
gorithm  at  a  higher  level  of  abstraction,  the  tool  is  able  to  aid  in  the  low-level  design  of  multi¬ 
ple  architectures.  However,  the  HLS  tools  are  not  well-equipped  to  handle  the  algorithmic-level 
exploration  and  optimizations  made  possible  by  Spiral’s  ability  to  manipulate  algorithms  and  ar¬ 
chitectures  within  the  same  mathematical  language.  For  example,  although  Lab  VIEW  can  manipu¬ 
late  [60] ’s  radix  2  FFT  algorithm  into  designs  with  different  width  and  depth,  it  is  not  able  to  convert 
to  a  radix  4  algorithm  or  a  mixed  radix  implementation. 

9.4  Streaming  Permutations 

The  work  on  streaming  permutations  presented  in  Chapter  5  is  unique  in  that  the  methods  pre¬ 
sented  are  the  only  ones  that  design  RAM-based  datapaths  for  a  generalized  family  of  streaming 
permutations  ([19]  is  applicable  to  any  streaming  permutation,  and  [20]  can  apply  to  an  important 
subset).  The  only  other  fully-general  streaming  permutation  method  is  [25],  which  builds  streaming 
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permutation  structures  using  a  register  allocation  method,  resulting  in  a  large  number  of  individual 
registers  connected  with  switches  or  multiplexers.  Others  consider  just  a  small  subset  of  streaming 
permutations:  in  [78,  79,  22],  a  method  based  on  FIFOs  and  switches  is  used  to  build  streaming 
stride  permutations  ( L  in  this  thesis)  with  two-power  size.  In  [23],  [80],  and  [81],  subsets  of  stream¬ 
ing  stride  permutations  are  considered  in  the  context  of  the  DFT. 


Chapter  10 


Conclusions  and  Future  Work 

10.1  Overview 

This  thesis  presented  the  Spiral  hardware  generation  framework,  an  automated  system  based  on  a 
mathematical  language  that  produces  hardware  cores  for  computing  linear  signal  processing  trans¬ 
forms.  This  system  automatically  implements  hardware  designs  across  a  wide  space  of  cost/perfor¬ 
mance  tradeoffs,  allowing  the  user  to  choose  the  implementation  that  best  fits  his  or  her  application. 
The  key  to  this  system  lies  in  the  mathematical  expression  of  relevant  degrees  of  freedom,  both  in 
the  space  of  transform  algorithms  and  in  the  space  of  sequential  datapaths. 

Chapter  2  presents  relevant  background  on  linear  transforms  and  their  algorithms.  Transforms 
are  represented  as  matrix-vector  products,  and  algorithms  are  decompositions  of  transform  matrices 
into  products  of  structured  sparse  matrices.  These  decompositions  can  be  written  using  a  mathe¬ 
matical  language  based  on  linear  algebra. 

Chapter  3  shows  how  structure  within  transform  algorithms  can  be  mapped  to  two  types  of 
sequential  datapaths.  First,  the  tensor  product  Im  <g>  An  can  be  implemented  as  m  parallel  instances 
of  An  or  as  a  system  that  employs  streaming  reuse,  which  would  result  in  <  m  parallel  instances  of 
An  with  the  data  vector  streamed  through  the  system  over  multiple  cycles.  This  can  be  generalized 
to  allow  dependence  on  an  iteration  variable:  Im  <E>t  A™.  The  cost  of  this  implementation  depends 
on  the  degree  of  similarity  between  the  Ar'f  matrices. 

The  iterative  matrix  product  n /?=7) '  can  be  implemented  as  a  cascade  of  A  blocks  (m  deep) 
or  as  a  system  with  iterative  reuse,  where  the  data  vector  feeds  back  and  iterates  over  a  shallower 
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system  (<  m  deep).  Like  streaming  reuse,  iterative  reuse  allows  dependence  on  variable  £  but  this 
dependence  can  increase  the  system  cost.  Section  3.4  presents  an  example  system  that  uses  both 
streaming  and  iterative  reuse,  and  explores  the  effect  they  have  on  cost  and  performance. 

Chapter  3  defines  variants  of  (8 >  and  that  allow  a  formula  to  explicitly  specify  the  degree  of 
streaming  reuse  and  iterative  reuse  in  a  formula.  The  result  is  a  new  extended  formula  language 
that  allows  specification  of  an  algorithm  and  a  sequential  reuse  datapath  on  which  to  compute  it.  By 
formally  capturing  the  desired  freedoms  at  both  algorithm  and  datapath  levels,  this  language  enables 
automatic  compilation  and  exploration. 

Chapter  4  describes  the  automated  compilation  and  optimization  process.  First,  the  system 
takes  as  input  a  problem  (for  example,  DFT1024)  and  hardware  directives  that  specify  the  amount 
and  type  of  sequential  reuse  requested  (as  tags).  Then,  the  system  decomposes  the  base  transform 
matrix  using  one  or  a  combination  of  several  transform  algorithms,  making  choices  based  on  the 
hardware  directives  supplied.  Then,  the  algorithm  is  rewritten  (based  upon  the  supplied  tags)  into  a 
formula  in  a  hardware  specification  language  that  describes  both  algorithm  and  sequential  datapath. 
Lastly,  the  formula  is  passed  to  a  hardware  generation  stage  that  builds  the  corresponding  datapath 
in  synthesizable  register  transfer  level  Verilog. 

Chapter  5  addresses  an  important  challenge  that  arises  when  mapping  transform  algorithms  to 
sequential  datapaths:  streaming  permutations.  A  permutation  is  a  fixed  reordering  of  data  elements, 
and  a  streaming  permutation  is  one  where  the  n  data  elements  stream  in  and  out  of  the  system  at  a 
fixed  rate  of  w  words  per  cycle.  In  order  to  perform  a  permutation  on  streaming  data  it  is  necessary 
to  use  memory  to  reorder  the  position  of  data  within  the  stream  (reorder  “in  time”).  In  the  simplest 
case  where  w  =  1,  this  is  easy:  a  single  memory  with  one  write  port  and  one  read  port  is  sufficient. 
However,  as  w  grows,  the  use  of  multiported  memory  quickly  become  infeasible.  So,  it  is  necessary 
to  use  multiple  independent  memories.  However,  now  a  new  challenge  arises:  how  to  orchestrate 
the  data’s  writes  and  reads  in  and  out  of  memory  while  guaranteeing  that  no  port  conflicts  will 
occur.  Chapter  5  discusses  solutions  to  this  problem,  including  the  “bit  matrix”  method  [20]  and  the 
“general  streaming  permutation”  method  [19],  which  are  used  by  Spiral  in  different  situations. 

Next,  Chapter  6  revisits  transform  algorithms  in  the  context  of  the  sequential  reuse  hardware 
paradigm  considered  in  this  thesis.  It  discusses  algorithms  for  the  DFT  (of  two-power  and  non-two- 
power  size),  RDFT,  two-dimensional  DFT,  and  DCT.  For  each  algorithm,  it  explains  how  sequential 
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reuse  can  be  applied,  and  with  what  restrictions.  When  multiple  algorithms  can  be  applied,  it 
discusses  how  the  compiler’s  formula  generation  stage  chooses  between  them. 

Chapter  7  presents  results  from  FPGA  and  ASIC  synthesis  of  a  space  of  designs  generated  using 
Spiral.  For  each,  it  illustrates  the  space  of  cost/performance  tradeoffs  obtainable,  and  compares 
with  benchmarks  where  available.  This  chapter  demonstrates  designs  across  several  transforms 
(DFT,  2DDFT,  RDFT,  DCT2),  two  platforms  (65nm  standard  cell  ASIC  and  FPGA),  and  two  data 
types  (fixed  point  and  floating  point).  The  FPGA  results  show  performance  versus  area  (in  units 
of  FPGA  slices,  BRAMs,  and  DSP  slices).  The  ASIC  results  compare  performance  against  area 
and  power.  They  also  explore  the  effect  of  frequency  scaling  on  power,  area,  and  performance. 
When  targeting  a  fixed  performance  goal.  Spiral  is  able  to  generate  multiple  designs  that  meet  the 
goal  while  exhibiting  a  tradeoff  between  power  and  area:  larger  designs  can  be  clocked  at  lower 
frequencies,  reducing  power. 

Chapter  8  evaluates  Spiral-generated  designs  used  within  orthogonal  frequency-division  mul¬ 
tiplexing  (OFDM)  transceivers  for  optical  networks.  The  freedoms  in  hardware  and  algorithm  al¬ 
lowed  by  Spiral  are  exploited  to  determine  the  implementations  best  suited  for  this  challenging 
application. 

Lastly,  Chapter  9  discusses  related  work  in  the  literature,  including  work  on  hardware  imple¬ 
mentation  of  transforms,  high-level  synthesis,  and  permutations.  This  chapter  includes  comparisons 
between  the  most  commonly-seen  DFT  architectures  and  those  that  are  produced  using  the  methods 
described  in  this  thesis. 

10.2  Directions  for  Future  Work 

This  thesis  concludes  with  a  brief  discussion  of  some  possible  future  extensions  of  this  work. 

Interfacing  with  off-chip  memory.  The  architectures  considered  in  this  thesis  require  that  the 
entire  data  vector  reside  on-chip  at  once.  That  is,  the  memory  elements  in  buffers  and  streaming  per¬ 
mutations  must  store  the  entire  data  vector  using  on-chip  memories.  This  means  that  the  minimum 
area  of  a  design  produced  by  Spiral  depends  on  the  transform  size,  regardless  of  the  performance 
requirement. 

A  promising  direction  for  future  work  is  to  expand  the  generator  to  include  designs  that  interact 
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with  off-chip  memory.  This  would  require  changes  to  both  the  hardware  paradigm  and  the  types 
of  algorithms  considered,  but  would  use  the  existing  infrastructure  to  generate  the  lower-level  im¬ 
plementations.  This  type  of  memory  streaming  is  considered  in  [82,  83],  which  extends  Spiral  to 
generate  software  for  distributed  memory  architectures. 

Register-  and  FSM-based  basic  blocks.  The  sequential  hardware  paradigm  explored  in  this 
work  uses  dense  matrices  to  specify  computational  basic  blocks.  That  is,  each  processing  element 
has  an  associated  matrix  and  contains  an  arithmetic  unit  for  each  operation  implied  by  the  base 
matrix.  For  example,  consider  the  matrix 

1  1 
1  -1 

In  this  thesis,  this  block  would  always  be  implemented  with  one  adder  and  one  subtractor;  it  would 
have  a  minimum  streaming  width  of  w  =  2. 

However,  if  lower  throughput  such  as  w  =  1  is  desired,  another  implementation  could  be  pro¬ 
duced  that  uses:  one  adder/subtractor,  three  data  registers,  a  two-bit  counter,  and  two  control  mul¬ 
tiplexers.  Data  would  be  buffered  in  registers  until  it  is  consumed,  and  a  small  finite  state  machine 
(FSM)  would  provide  control.  Reducing  the  throughput  requirement  further  (for  example  one  word 
every  two  cycles)  could  lead  to  further  simplifications. 

In  this  way,  a  more  flexible  set  of  register-  and  FSM-based  datapaths  could  serve  as  compu¬ 
tational  elements.  Additionally,  this  type  of  technique  can  lead  to  less  dependence  on  algorithmic 
regularity  to  obtain  cost/performance  tradeoffs.  Further,  this  extension  may  be  particularly  useful  if 
this  work  were  extended  outside  the  domain  of  linear  transforms. 

Extension  to  other  kernels.  The  mathematical  language  used  in  this  thesis  represents  the  prob¬ 
lem  to  be  computed  as  a  matrix-vector  product.  So,  the  problems  that  can  be  considered  are  limited 
to  the  domain  of  linear  transforms.  However,  the  types  of  algorithmic  structure  and  dataflow  that 
this  thesis  exploits  appear  in  algorithms  for  other  types  of  problems.  Recent  work  [84]  introduces 
Operator  Language  (OL),  a  language  related  to  the  one  used  in  this  thesis,  which  can  represent 
non-linear  problems  while  still  mathematically  capturing  structure  within  the  algorithm.  [84]  uses 
OL  to  generate  software  for  Viterbi  decoders,  synthetic  aperture  radar  image  reconstruction,  sorting 
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networks,  matrix-matrix  multiplication,  and  circular  convolution.  The  system  in  this  thesis  may  be 
able  to  be  extended  to  support  a  subset  of  OL,  allowing  hardware  generation  for  a  wider  space  of 
problems. 

Improvements  to  streaming  permutation  structures.  This  thesis  uses  two  methods  for  de¬ 
signing  datapaths  to  perform  permutations  on  streaming  data:  the  bit  matrix  method  [20]  and  the 
general  method  [19].  As  explained  in  Chapter  5,  the  bit  matrix  method  generates  switching  net¬ 
works  and  control/addressing  logic  specifically  for  the  given  permutation.  The  general  streaming 
permutation  method  instead  uses  a  generic  all-to-all  network,  and  uses  lookup  tables  to  store  all  of 
the  necessary  control  and  address  values.  So,  it  is  trivial  to  use  the  general  method  to  construct  one 
structure  to  perform  multiple  permutations  (in  the  event  that  a  streaming  permutation  is  iteratively 
reused)  by  extending  the  lookup  tables  to  store  precomputed  constants  for  each  of  the  permutations 
desired.  However,  it  may  be  possible  to  extend  the  bit  matrix  method  for  the  same  purpose  at  lower 
cost. 

Another  possible  extension  is  an  improvement  to  the  general  streaming  permutation  method. 
When  generating  designs  using  this  technique,  a  matrix  representing  data  locations  is  decomposed 
into  a  sum  of  permutation  matrices,  as  shown  in  (5.3): 

T^w{Pn)  =  PoQo  +  filQl  +  '  '  ‘  +  Pk-lQk-l, 

where  nw(Pn )  is  a  w  x  w  semi-magic  square  (the  sum  of  each  of  its  rows  and  columns  is  n/w), 
each  of  the  Qi  are  permutation  matrices,  and  the  are  integers  >  0.  Given  a  tt w(Pn),  the  values  of 
Qi  and  /)t  are  not  unique:  the  matrix  can  be  decomposed  in  multiple  ways. 

Currently,  Spiral  performs  this  factorization  by  finding  any  permutation  contained  in  irw(Pn) 
and  subtracting  it  as  many  times  as  possible  (that  is,  finding  a  (3i  and  Q,j.  However,  the  overall  cost 
of  the  implementation’s  connection  network  (T  in  Figure  5.3)  is  related  to  the  set  of  permutations 
found  here.  In  the  worst  case,  a  full  generic  connection  network  must  be  implemented  to  support  all 
of  the  Qi,  consisting  of  w'  log^  w'  —  w'  +  1  switches,  where  w'  =  2  u,g'2  and  w  is  the  streaming 
width.  However,  if  nw  ( Pn )  can  be  factorized  using  a  subset  of  permutations,  a  simplified  connection 
network  with  approximately  half  the  cost  could  be  used  (such  as  those  in  [85,  86,  87]). 

Lastly,  both  streaming  permutation  methods  described  here  require  double-buffering.  This  dou- 
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bles  the  memory  requirement  of  each  implementation  so  the  system  can  begin  buffering  one  input 
vector  while  the  previous  vector  streams  out.  This  overhead  may  be  eliminated  for  some  permuta¬ 
tions  if  two  schedules  can  be  produced  which  overlap  such  that  schedule  1  reads  data  from  the  exact 
locations  that  schedule  2  is  about  to  write  to.  Both  permutation  methods  considered  in  this  thesis 
may  be  able  to  be  extended  to  support  this  for  a  subset  of  permutations. 

Hardware/software  co-design.  In  addition  to  the  hardware  generation  work  described  in  this 
thesis,  Spiral  also  generates  software  implementations  of  transforms  for  many  platforms  [4].  Thus, 
a  natural  direction  for  future  work  is  in  the  co-design  of  hardware  and  software.  Preliminary  work 
on  this  topic  has  been  presented  in  [88],  which  uses  a  library  of  FPGA  implementations  of  the  DFT 
generated  by  Spiral  (using  the  techniques  described  in  this  thesis)  as  accelerators  for  an  embedded 
processor.  Given  a  set  of  accelerators,  Spiral  searches  across  a  family  of  recursive  algorithms  in 
order  to  generate  a  library  of  implementations  that  use  both  hardware  and  software.  Extensions 
to  this  work  could  include  a  larger  search  space  of  possible  hardware  implementations  and  a  more 
sophisticated  family  of  parameterized  processors,  all  of  which  could  be  tuned  based  on  desired 
system  characteristics,  such  as  off-chip  memory  bandwidth  or  power  and  area  budgets. 

Resource  estimation.  Spiral  is  currently  able  to  provide  resource  counts  from  a  hardware  for¬ 
mula,  giving  the  number  and  type  of  arithmetic  and  memory  units.  However,  it  could  be  extended 
to  estimate  area  in  much  more  detail.  This  could  significantly  speed  up  the  exploration  process  by 
allowing  designers  to  find  the  points  that  best  fit  their  implementation  goals  more  easily.  Previous 
work  [21]  performed  detailed  area  estimation  for  designs  generated  using  the  Pease  FFT  with  radix 
2,  streaming  reuse  of  a  parameterizable  width,  and  iterative  reuse  of  depth  1  on  an  FPGA. 
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